layout: post date: 2023-07-19 20:50:59 +0800 title: "Machine Learning & Statistical Physics" tags: [Research]

This note is for the summer school of Machine Learning and statistical physics held at YNU, 2023.

Written by Shigang Ou (oushigang19@mails.ucas.ac.cn)

This note is under rapid change, and may contain mis-information or redundent content, including some extended information not mentioned in the summer school

What is Neural Network?

2023-07-19, Huang: "Statistical physics of Neural Network"

Network Name Developer Name Year Introduction
Multilayer Perceptron (MLP) Frank Rosenblatt 1958 A simple and widely used type of neural network that consists of an input layer, one or more hidden layers, and an output layer of artificial neurons.
Network John 1982 A type of recurrent neural network that can store and retrieve patterns as stable states. It consists of a single layer of fully connected neurons with symmetric weights and binary activations.
Recurrent Neural Network (RNN) David Rumelhart et al. 1986 A type of neural network that can process sequential data such as text, speech, and series. It consists of a hidden layer of neurons that have recurrent connections to themselves, forming a loop.
Convolutional Neural Network (CNN) Yann LeCun 1989 A type of feedforward neural network that can process high-dimensional data such as images, videos, and speech. It consists of multiple layers of neurons that perform convolution operations on the input data, followed by pooling layers and fully connected layers.
Spiking Neural Network (SNN) Eugene Izhikevich et al. 2003 A type of neural network that mimics the behavior of biological neurons more closely than other types. It consists of spiking neurons that communicate with each other using discrete pulses or spikes.

This list can go on and on, along with the history of winters and springs of AI. But how to understand the neural network in a more general way?

Some well-established theories in the history can still be used today

MLP

MLP is defined as:

xWDσD1(WD1σD2(σ1(W1x))),x \mapsto W_D \sigma_{D-1}(W_{D-1} \sigma_{D-2}(\dots \sigma_1(W_1 x))),

where WiW_i is the weight matrix of the ii-th layer, σi\sigma_i is the activation function of the ii-th layer, and DD is the depth of the network.

WHERE IS MY BIAS?
Assume a layer with input x, weight W, bias b, and activation sigma
A typical layer is defined as:
y = \sigma(Wx + b) 

Append 1 to x and b to W to get new input x_tilde and new weight W_tilde
x_tilde = (x, 1)^\top
W_tilde = (W, b)

Rewrite the output of the layer using x_tilde and W_tilde
y = \sigma(W_tilde x_tilde) 

This has same output as before, but no bias term

This function is unreasonably effective in many tasks, but why?

Over-parameterization is the key.

https://arxiv.org/pdf/2008.06786.pdf: Triple Descent and a Multi-Scale Theory of Generalization

Image

Traditional idea: more parameters, more overfitting.

This is not the full story: pp is the number of parameters, mm is the number of training samples.

Explained in 2019 by Deep Double Descent: Where Bigger Models and More Data Hurt:

Hopfield Network

Neural Networks and Physical Systems with Emergent Collective Computational Abilities:

Hopfield network is defined as:

Vi(t+1)={1if j=1NTijVj(t)>Ui0if j=1NTijVj(t)<UiVi(t)otherwiseV_i(t+1) = \begin{cases} 1 & \text{if } \sum_{j=1}^N T_{ij}V_j(t) > U_i \\ 0 & \text{if } \sum_{j=1}^N T_{ij}V_j(t) < U_i \\ V_i(t) & \text{otherwise} \end{cases}

where Vi(t)V_i(t) is the state of neuron ii at time tt, TijT_{ij} is the weight of the synapse from neuron jj to neuron ii, and UiU_i is the threshold value for neuron ii.

Image

This definition mimics the biological neurons like Fig.1, anything cool?

Computational ability that can generalize, categorize, correct errors, and recognize familiarity.

If we want to store a binary vector Vs{0,1}nV^s\in \{0,1\}^n to it, we can set the weight matrix as:

Tij=s=1n(2Vis1)(2Vjs1)T_{ij} = \sum_{s=1}^n (2V_i^s - 1)(2V_j^s - 1)

Here 2V1[1,1]2V-1 \in [-1, 1] is the binary to bipolar transformation.

同种状态相互促进,不同状态相互抑制

Here weight matrix is fixed, unlike the MLP.

To retrieve a stored memory from an initial state V(0)V(0), the neurons are then updated randomly and asynchronously by the Hopfield network dynamics until the network reaches a stable state.

The update rule is asynchronous and random? Why not synchronous and deterministic?

The update rule is not important, the energy function is.

We can define the energy function of the network as:

E=12i,j=1NTijViVj+i=1NUiViE = -\frac{1}{2}\sum_{i,j=1}^N T_{ij}V_iV_j + \sum_{i=1}^N U_iV_i

The first term is the interaction energy between neurons, and the second term is the external energy of the neurons.

It's proved that the energy function is a Lyapunov function of the Hopfield network dynamics, which means that the energy function will decrease or remain constant at each step, until it reaches a minimum value that corresponds to a stable state of the network.

Proof:

E(t+1)E(t)=12i,j=1NTij(Vi(t+1)Vi(t))(Vj(t+1)Vj(t))=12i,j=1NTij(Vi(t+1)Vi(t))20\begin{aligned} E(t+1) - E(t) &= -\frac{1}{2}\sum_{i,j=1}^N T_{ij}(V_i(t+1) - V_i(t))(V_j(t+1) - V_j(t)) \\ &= -\frac{1}{2}\sum_{i,j=1}^N T_{ij}(V_i(t+1) - V_i(t))^2 \\ &\leq 0 \end{aligned}

Do the stable states acutally correspond to the stored memories?

Not necessarily. The stable states of the Hopfield network are the local minima of the energy function, which may not correspond to the stored memories. Ising Model and Spin Glass provided the answer for different TijT_{ij}.

Image

This is the MC simulation result, local minima, or noise, exists. How to reduce it?

This paper find clipping the weights increase noise.

What cool application can be made out of this? In principle: MNIST digit recognition, image denoising, and even generative models.

See:

Btw, Restricted Boltzmann Machine (RBM) is a special case of Hopfield network.


Hands-on session!

For coding convenience, we can define the Hopfield network as:

Hopfield network is a Graph G=(V,T)G=(V, T), where VV is the set of vertices and WW is the set of edges. The state of the network is a binary vector V(t){1,1}nV(t)\in \{-1,1\}^n, where nn is the number of vertices. The weight matrix TT is defined as:

Tij=s=1nVisVjsT_{ij} = \sum_{s=1}^n V_i^sV_j^s

where VisV_i^s is the ii-th element of the ss-th stored memory VsV^s.

<!-- In Progress  -->

Curie-Weiss Model

Curie-Weiss model is a mean-field model of ferromagnetism, which was introduced by Pierre Curie and Pierre Weiss in 1906. It is used to describe the behavior of ferromagnetic materials above the Curie temperature, where the magnetic moments of the atoms are randomly oriented. The model assumes that the magnetic moments are aligned in the same direction, and it predicts that the magnetization of the material is proportional to the applied magnetic field.

Energy:

H=12i<jJsisj\mathcal{H} = -\frac{1}{2} \sum_{i<j} Js_i s_j

This contains only 1 parameter, JJ.

The solution is

m=tanh(β(n1)Jm)m = \tanh{(\beta (n-1) J m)}

where mm is the magnetization, nn is the number of particles, JJ is the interaction strength, and β\beta is the inverse temperature.

This equation has two solutions:

1 parameter saves 2 data points.

Network Phase Diagram

Image

Taken from Dreaming Neural Networks: Forgetting spurious memories and reinforcing pure ones

To learn from data, the weights of the network weights JijJ_{ij} are chosen as:

Jij=arg minJij(βμ=1mi<jxiμxjμJijlogZ)J_{ij} = \argmin_{J_{ij}} \Big(-\beta \sum_{\mu=1}^m \sum_{i<j} x_i^\mu x_j^\mu J_{ij} - \log Z\Big)

Which is derived from the maximum likelyhood estimation.

To minimize it, we take derivative of it with respect to JijJ_{ij}:

LJij=βμ=1mxiμxjμ+logZJij=βμ=1mxiμxjμ+βμ=1mxiμxjμJ=β(xixjdataxixjJ)\frac{\partial \mathcal{L}}{\partial J_{ij}} = -\beta \sum_{\mu=1}^m x_i^\mu x_j^\mu + \frac{\partial \log Z}{\partial J_{ij}}\\ = -\beta \sum_{\mu=1}^m x_i^\mu x_j^\mu + \beta \sum_{\mu=1}^m \lang x_i^\mu x_j^\mu \rang_J\\ = -\beta ( \lang x_i x_j \rang_\text{data} - \lang x_i x_j \rang_J)

where xixjdata\lang x_i x_j \rang_\text{data} is the average of xixjx_i x_j over the data, and xixjJ\lang x_i x_j \rang_J is the average of xixjx_i x_j over the distribution defined by the weights JijJ_{ij}.

This is inverse Ising problem. Because we want to find JijJ_{ij} from xix_i. i.e. design Hamiltonian to fit the data.

Phase diagram is so cool, it has been applied in NN like here: Image Taken from Phase diagram for two-layer ReLU neural networks and infinite-width limit

The above model only considers two body interaction, but in reality, there's higher order interaction. (Not discussed yet)

Consider Ising model with hidden variable:

E=i,asiσaJia+i<jsisjJij+a<bσaσbJabE = \sum_{i, a}s_i\sigma_a J_{ia} + \sum_{i<j} s_i s_j J_{ij} + \sum_{a<b} \sigma_a \sigma_b J_{ab}

where sis_i is the spin of the ii-th particle, σa\sigma_a is the spin of the aa-th hidden variable, JiaJ_{ia} is the interaction strength between particle ii and hidden variable aa, JijJ_{ij} is the interaction strength between particles ii and jj, and JabJ_{ab} is the interaction strength between hidden variables aa and bb.

But we can only observe sis_i, so we need to marginalize over σa\sigma_a:

p(s)=σp(s,σ)=σeβE(s,σ)Zp(s) = \sum_{\sigma} p(s,\sigma) = \sum_{\sigma} \frac{e^{-\beta E(s,\sigma)}}{Z}

The loss is

L=μ=1mlogσeβE(xμ,σ)/Z\mathcal{L} = \sum_{\mu=1}^m \log \sum_{\sigma} e^{-\beta E(x^\mu,\sigma)}/Z

This is the same as the above.

Similar to the above, we take derivative of it with respect to JiaJ_{ia}:

LJia=βμ=1mxiμσaμ+βμ=1mxiμσaμJ=β(xiσadata + hiddenxiσaJ)\frac{\partial \mathcal{L}}{\partial J_{ia}} = -\beta \sum_{\mu=1}^m x_i^\mu \sigma_a^\mu + \beta \sum_{\mu=1}^m \lang x_i^\mu \sigma_a^\mu \rang_J\\ = -\beta ( \lang x_i \sigma_a \rang_\text{data + hidden} - \lang x_i \sigma_a \rang_J)

This is slow using MCMC. Hinton proposed a faster way: RBM. Which is a bipartite graph where hidden variable is independent of each other, and visible variable is independent of each other too.

Hinton proposed Contrastive Divergence (CD) algorithm to solve this problem.

NN dynamics and it's weight spectrum

Image Taken from Correlation Between Eigenvalue Spectra and Dynamics of Neural Networks

The main contributions of the paper are:

A new proposal:

find different phases and order parameters of NN. Integrate existing works.

The NN can be MLP, Hopfield, RNN, CNN, SNN, etc.

One work relating to the grokking behavior in NNs:

Image Taken from Towards Understanding Grokking: An Effective Theory of Representation Learning

but it does not offer any meaningful insight into the phase diagram.

Image

This so-called phase diagram/phase transition is an abuse of terms in physics. Limited theoretical insights. More discussion see OpenReview of this paper

A reviewer claim that: There has been a couple of papers that use tools from physics and provide phase diagrams for understanding the generalization of neural networks:

Here the last paper apply Noether's theorem to NN, yielding predictions to the training dynamics of NNs.

Network

https://www.ncbi.nlm.nih.gov/pmc/articles/PMC346238/pdf/pnas00447-0135.pdf: Neural Networks and Physical Systems with Emergent Collective Computational Abilities

Unsupervised Learning

Boltzman Machine

set

p(x)=eE(x)Zp(x)=\frac{e^{-E(x)}}{Z}

where Z=xeE(x)Z=\sum_x e^{-E(x)} is the normalization factor. we can learn E(x)E(x) from the data by minimzie

L=1DxDlnp(x)\mathcal{L}=-\frac{1}{|D|}\sum_{x\in D}\ln{p(x)}

for example, if we set E(x)=12WijxixjE(x)=-\frac{1}{2}W_{ij}x_ix_j, then all we need is the gradient of L\mathcal{L} w.r.t. WijW_{ij}:

LWij=12(xixjxDxixjxp)\frac{\partial \mathcal{L}}{\partial W_{ij}} = -\frac{1}{2}(\lang x_ix_j \rang_{x\thicksim D} - \lang x_ix_j \rang_{x \thicksim p})

and we can use gradient descent to minimize it

Wij=Wij+η(xixjxDxixjxp)W_{ij}'=W_{ij} + \eta (\lang x_ix_j \rang_{x\thicksim D} - \lang x_ix_j \rang_{x \thicksim p})

What makes the algorithm costy is the generation of xpx\thicksim p, which is also updated and generated every step using MCMC. To deal with this, one may use the samples from DD, for example the last batch, to replace pp.

In practice, when we trained the energy function, we often treat it as the following form

E(x)=iaixijln(1+eiWijxi+bj)E(x)=-\sum_{i}a_ix_i - \sum_j\ln{(1 + e^{\sum_{i}W_{ij}x_i + b_j})}

which is derived from the following hypothesis

p(x)=1ZheE(x,h) (hidden variable)E(x,h)=iaixibjhji,jxiWijhj (energy paramterization)p(hx)=Πjp(hix) (bipartite graph)p(xh)=Πip(xih) (bipartite graph)p(x)=\frac{1}{Z}\sum_h e^{-E(x,h)} \text{~(hidden variable)}\\ E(x,h)=-\sum_i a_ix_i -\sum b_j h_j - \sum_{i,j}x_iW_{ij}h_j \text{~(energy paramterization)}\\ p(h|x)=\Pi_j p(h_i|x)\text{~(bipartite graph)}\\ p(x|h)=\Pi_i p(x_i|h)\text{~(bipartite graph)}\\

The gradient in general would be

Lθ=EθxD,hp(hx)Eθxp(x),hp(hx)\frac{\partial \mathcal{L}}{\partial \theta} = \lang \frac{\partial E}{\partial \theta} \rang _{x\thicksim D, h\thicksim p(h|x)} - \lang \frac{\partial E}{\partial \theta} \rang_{x \thicksim p(x),h \thicksim p(h|x)}

自动重构

input equals output.

Variational Autoencoder, which works as follows:

  1. encode the input into a distribution q(zx)q(z|x)
  2. sample zz from q(zx)q(z|x)
  3. decode zz into a distribution p(xz)p(x|z)
  4. sample xx from p(xz)p(x|z)
  5. minimize the negative log likelyhood of xx w.r.t. q(zx)q(z|x) and p(xz)p(x|z)

GAN

set

maxGminDL(D,G)=Expdata(x)[logD(x)]+Ezpz(z)[log(1D(G(z)))]\max_{G} \min_{D} \mathcal{L}(D,G) = \mathbb{E}_{x\thicksim p_{data}(x)}[\log{D(x)}] + \mathbb{E}_{z\thicksim p_{z}(z)}[\log{(1-D(G(z)))}]

where GG is the generator, DD is the discriminator, pdata(x)p_{data}(x) is the data distribution, pz(z)p_{z}(z) is the noise distribution, and G(z)G(z) is the generated data.

Contrast Learning

Contrast Learning is a general framework for unsupervised learning. The idea is to learn a representation that makes the data distribution and the noise distribution distinguishable.

Score-Based Diffusion Model

Image

Jacot e.t.a. 2018

Statistical Physics of Learning

玻璃态

非磁金属中的磁性杂质:AuFe 导致自旋磁矩方向无规,长程无序。

Glass state is a non-equilibrium, non-crystalline condensed state of matter that exhibits a glass transition when heated towards the liquid state.

Edwards-Anderson model: Ising spin glass.

Sherrington-Kirkpatrick model: Heisenberg spin glass.

Replica Method

Goal: compute free energy F=1βlnZF=-\frac{1}{\beta}\ln{Z}

  1. write out Z: Z={si}eβH(s1,,sN)Z=\sum_{\{s_i\}}e^{-\beta H(s_1,\dots,s_N)}
  2. use limit property: lnZ=limn0Zn1n-\ln{Z}=\lim_{n\to 0}\frac{Z^n-1}{n}
  3. expand it: Zn={sia}eβH(s11,,sN1)eβH(s1n,,sNn)Z^n=\sum_{\{s_i^a\}}e^{-\beta H(s_1^1,\dots,s_N^1)}\dots e^{-\beta H(s_1^n,\dots,s_N^n)}
  4. linearize the exponent quadratic term: eβH(s1a,,sNa)eβH(s1b,,sNb)=eβH(s1a,,sNa,s1b,,sNb)e^{-\beta H(s_1^a,\dots,s_N^a)}e^{-\beta H(s_1^b,\dots,s_N^b)}=e^{-\beta H(s_1^a,\dots,s_N^a,s_1^b,\dots,s_N^b)}
  5. integrate out the Gaussian variables: Zn=a=1ni=1Ndsia2πe12(a=1ni=1Nsia)2eβa=1nH(s1a,,sNa)Z^n=\int \prod_{a=1}^n \prod_{i=1}^N \frac{ds_i^a}{\sqrt{2\pi}}e^{-\frac{1}{2}(\sum_{a=1}^n \sum_{i=1}^N s_i^a)^2}e^{-\beta \sum_{a=1}^n H(s_1^a,\dots,s_N^a)}
  6. Laplace approximation: Zn=a=1ni=1Ndsia2πe12(a=1ni=1Nsia)2eβa=1nH(s1a,,sNa)a=1ni=1Ndsia2πe12(a=1ni=1Nsia)2eβa=1nH(s1a,,sNa)eβ22a,b=1ni,j=1NsiasjbHijZ^n=\int \prod_{a=1}^n \prod_{i=1}^N \frac{ds_i^a}{\sqrt{2\pi}}e^{-\frac{1}{2}(\sum_{a=1}^n \sum_{i=1}^N s_i^a)^2}e^{-\beta \sum_{a=1}^n H(s_1^a,\dots,s_N^a)}\approx \int \prod_{a=1}^n \prod_{i=1}^N \frac{ds_i^a}{\sqrt{2\pi}}e^{-\frac{1}{2}(\sum_{a=1}^n \sum_{i=1}^N s_i^a)^2}e^{-\beta \sum_{a=1}^n H(s_1^a,\dots,s_N^a)}e^{-\frac{\beta^2}{2}\sum_{a,b=1}^n \sum_{i,j=1}^N \lang s_i^a s_j^b \rang H_{ij}}

Replica Symmetry Breaking

The replica method is a technique to deal with quenched disorder in statistical physics, such as in spin glass models. The idea is to compute the average of the logarithm of the partition function by introducing replicas of the system and taking the limit of zero replicas. The partition function of n replicas can be written as:

Zn=i<jdJijP(Jij){σ}exp(βai<jJijσiaσja+βhaiσia)Z^n = \int \prod_{i<j} dJ_{ij} P(J_{ij}) \sum_{\{\sigma\}} \exp\left(\beta \sum_a \sum_{i<j} J_{ij} \sigma_i^a \sigma_j^a + \beta h \sum_a \sum_i \sigma_i^a\right)

where σia\sigma_i^a is the spin variable of the ii-th site and the aa-th replica, JijJ_{ij} is the random coupling between sites ii and jj, P(Jij)P(J_{ij}) is the probability distribution of the couplings, β\beta is the inverse temperature, and hh is the external magnetic field.

The replica method assumes that there is a unique analytic function that interpolates the values of ZnZ^n for integer nn, and that this function can be analytically continued to real values of nn. Then, one can write:

limn0Zn1n=logZ\lim_{n\to 0} \frac{Z^n - 1}{n} = \log Z

The average free energy per spin can then be obtained by:

f=limn01βnNlogZnf = -\lim_{n\to 0} \frac{1}{\beta n N} \log Z^n

where NN is the number of spins in the system.

The main difficulty in applying the replica method is to find a suitable representation for the order parameter that describes the correlations between replicas. This order parameter is usually an n×nn\times n symmetric matrix QabQ_{ab}, where QabQ_{ab} is the overlap between replicas aa and bb, defined as:

Qab=1NiσiaσibQ_{ab} = \frac{1}{N} \sum_i \sigma_i^a \sigma_i^b

The matrix QabQ_{ab} encodes the structure of the phase space of the system, and how it is partitioned into different states or clusters.

The simplest assumption for the matrix QabQ_{ab} is that it is invariant under permutations of replicas, meaning that all replicas are equivalent. This is called the replica symmetric (RS) Ansatz, and it implies that:

Qaa=0Q_{aa} = 0

Qab=qabQ_{ab} = q \quad a\neq b

where qq is a constant that measures the average overlap between replicas.

However, it turns out that the RS Ansatz is not valid for some systems, such as the Sherrington-Kirkpatrick (SK) model of spin glasses. In these systems, there are many metastable states that are separated by large energy barriers, and different replicas can explore different regions of phase space. This leads to a more complicated structure for the matrix QabQ_{ab}, which requires breaking replica symmetry.

Replica symmetry breaking (RSB) is a way to generalize the RS Ansatz by allowing for different values of overlaps between replicas, depending on how they are grouped or clustered. The most general form of RSB is called full RSB, and it involves an infinite hierarchy of breaking levels. However, for some systems, such as the SK model, a simpler form of RSB is sufficient, called one-step RSB (1-RSB).

In 1-RSB, one divides the replicas into n/mn/m groups of size mm, where mm is a real parameter between 0 and 1. Then, one assumes that:

Qaa=0Q_{aa} = 0

Qab=q1a,bsame groupQ_{ab} = q_1 \quad a,b \in \text{same group}

Qab=q0a,bdifferent groupsQ_{ab} = q_0 \quad a,b \in \text{different groups}

where q1>q0q_1 > q_0 are constants that measure the intra-group and inter-group overlaps, respectively.

The 1-RSB Ansatz captures the idea that there are clusters of states that have a higher overlap within them than between them. The parameter mm controls how probable it is to find two replicas in the same cluster.

Using the 1-RSB Ansatz, one can compute the free energy per spin as:

f(q1,q0,m)=β4(1+mq02+(1m)q122q1)1mβDzlog[Dyq1q02πexp(y22(q1q0))(2cosh(β(z+y)))m]f(q_1,q_0,m) = -\frac{\beta}{4}(1 + mq_0^2 + (1-m)q_1^2 - 2q_1) -\frac{1}{m\beta}\int Dz \log\left[\int Dy\sqrt{\frac{q_1-q_0}{2\pi}} \exp\left(-\frac{y^2}{2(q_1-q_0)}\right) (2\cosh(\beta(z+y)))^m\right]

where Dz=dzq02πexp(z22q0)Dz = dz \sqrt{\frac{q_0}{2\pi}} \exp\left(-\frac{z^2}{2q_0}\right) is a Gaussian measure.

The self-consistency equations for q1q_1, q0q_0, and mm can be obtained by extremizing the free energy with respect to these parameters. The solution of these equations gives the correct description of the low-temperature phase of the SK model, as proven by Parisi [6].

A general procedure of the replica method is as follows:

Image

(taken from Boazbarak)

The crazy part of Replica method is that: We want to calculate logZ\log Z, but turns out be evaluating a diferent value, and they equals each other magically! (for only SK model, no proof yet for other methods)

Cavity Method and 2-layer Perceptron

Cavity Method

The cavity method is a technique to compute the average of a function of a random variable in a large system, by using a recursive equation that describes the correlations between the variable and its neighbors. It is used in statistical physics to compute the free energy of spin glass models, and in computer science to analyze the performance of message passing algorithms on random graphs.

大偏差理论

ML algorithms don't stop at the saddle points that exists indepdently, but stay on the large (although it might seem rare) basins of attraction.

Subdominant dense clusters allow for simple learning and high computational performance in Neural netowkrs with dicere synaps.

Quantum Many-body Physics Introduction

Quantum Many-body Physics

Quantum many-body physics is the study of the behavior of systems made of many interacting particles, where quantum mechanics plays an essential role. It is used to describe a wide range of physical phenomena, from the behavior of electrons in metals and semiconductors, to the superfluidity of liquid helium, the Bose-Einstein condensation of ultracold atoms, and the superconductivity of certain materials at low temperatures.

Quantum Many-body Problem

The quantum many-body problem is the problem of finding the ground state of a quantum system with many interacting particles. It is one of the most challenging problems in physics, due to the exponential complexity of the Hilbert space of the system. The difficulty of the problem depends on the dimensionality of the system, the type of interactions, and the symmetries of the Hamiltonian.

Define

H=i=1Npi22m+i<jV(rirj)\mathcal{H} = \sum_{i=1}^N \frac{p_i^2}{2m} + \sum_{i<j} V(r_i - r_j)

where rir_i is the position of the ii-th particle, pip_i is its momentum, mm is its mass, and V(rirj)V(r_i - r_j) is the interaction potential between particles ii and jj.

The ground state of the system is the state with the lowest energy, and it is the state that the system will tend to occupy at zero temperature.

Quantum Many-body Problem in 1D

In one dimension, the quantum many-body problem can be solved exactly using the Bethe ansatz. The Bethe ansatz is a method to find the eigenstates of a system with integrable Hamiltonian, by writing them as a linear combination of plane waves. It was first introduced by Hans Bethe in 1931 to solve the Heisenberg model of ferromagnetism, and it was later generalized to other models such as the Hubbard model of interacting electrons.


Quantum Many-body System introduction

A quantum many-body system is a system with many interacting particles, where quantum mechanics plays an essential role. It is used to describe a wide range of physical phenomena, from the behavior of electrons in metals and semiconductors, to the superfluidity of liquid helium, the Bose-Einstein condensation of ultracold atoms, and the superconductivity of certain materials at low temperatures.

competing order: different order parameters compete with each other.

Macroscopic quantum phenomena: superconductivity, superfluidity, Bose-Einstein condensation.

Nobel Prize related: 1972, 2003, 2016, 2017, 2019

Bosion: superfluidity, Bose-Einstein condensation

Fermion: superconductivity

Exponential complexity of the Hilbert space of the system: 2N2^N, where N1023N\thicksim 10^{23}.

Superconductivity

The superconductivity can be defined as:

Meissner effect:perfect diamagnetismZero resistance:perfect conductivityCooper pair:bound state of two electronsBCS theory:Cooper pair condensation\begin{aligned} \text{Meissner effect:} &\quad \text{perfect diamagnetism} \\ \text{Zero resistance:} &\quad \text{perfect conductivity} \\ \text{Cooper pair:} &\quad \text{bound state of two electrons} \\ \text{BCS theory:} &\quad \text{Cooper pair condensation} \end{aligned}

During the writing of this note, LK-99, a material that claims to have superconductivity, is attracting public attention.

Can theory really guide the dicovery of novel material? The answer seems to be NO until now.

Experiment facts:

Theory explaining isotopic effect: BCS theory.

BCS theory basics:

It explains the isotopic effect by :

BCS Theory

Three key points:

Cooper Pair

Eb=g2(kF)n2(ωD)V1E_b = -g^2(k_F) n^2(\omega_D) V^{-1}

where g(kF)g(k_F) is the coupling constant at zero momentum transfer, n(ωD)n(\omega_D) is the Bose-Einstein distribution function for phonons at frequency ωD\omega_D, which is the Debye frequency, and VV is the volume of the system.

Superfluidity

PWAnderson: Imagine yourself in a room full of people, each of whom is wearing a pair of roller skates. You are not wearing roller skates. You want to move across the room, but you cannot walk. What do you do? You grab onto the hand of the person nearest you, and the two of you move together across the room. Then you let go, grab onto the hand of the next person, and so on. In this way, you can move across the room without ever putting on a pair of roller skates. This is how electrons move through a superconductor.

Quantum Simulation using cold atom in optical lattice: Nature 452, 854 (2008)

Bose-Hubbard Model for Superfluidity and Mott Insulator

define:

H=ti,j(bibj+bjbi)+U2ini(ni1)μini\mathcal{H} = -t \sum_{\langle i,j \rangle} (b_i^\dagger b_j + b_j^\dagger b_i) + \frac{U}{2} \sum_i n_i (n_i - 1) - \mu \sum_i n_i

where bib_i^\dagger is the creation operator of a boson at site ii, ni=bibin_i = b_i^\dagger b_i is the number operator at site ii, tt is the hopping amplitude, UU is the on-site interaction strength, and μ\mu is the chemical potential.

The ground state of the system is the state with the lowest energy, and it is the state that the system will tend to occupy at zero temperature.

The ground state of the system can be either a superfluid or a Mott insulator, depending on the values of the parameters tt, UU, and μ\mu.

The superfluid phase is characterized by the presence of long-range phase coherence, which means that the phase of the wavefunction is the same at all sites. This allows the bosons to move freely through the lattice, and it leads to a finite superfluid density.

The Mott insulator phase is characterized by the absence of long-range phase coherence, which means that the phase of the wavefunction is different at different sites. This prevents the bosons from moving freely through the lattice, and it leads to a zero superfluid density.

Solve for the ground state of the system using the Gutzwiller ansatz, which is a variational ansatz that assumes that the ground state is a product state of the form:

Ψ=iψi\ket{\Psi} = \prod_i \ket{\psi_i}

where ψi\ket{\psi_i} is the state of site ii.

The Gutzwiller ansatz is exact for the Mott insulator phase, and it gives a good approximation for the superfluid phase.

Spin Glass: Theory and applications

Setting

NN element, with property σi\sigma_i, energy Ei(σi)E_i(\sigma_i).

element will interact.

Connections between ML and SP

2023-07-23 09:11:11 Pan Zhang, SP, ML, QC

ML is "essentially" fitting the joint probability distribution of the data.

Simulation methods in one field can be used in the other fields.

Do nature essentially computing?

Is the universe a computer?

Computation is defined as the process of transforming information from one form to another. It is a fundamental process that occurs in nature, and it is the basis of all physical phenomena.

Can we mimic nature's computation?

Can we harness nature's computation?

Why there's different phase and phase transition?

Free energy: Consider all possible configurations.

At low T, liquid phase is possible, but with vanishing probability.

Nature choose the phase with lowest free energy.

Will nature fail?

Super-cooled liquid: liquid phase with lower free energy than solid phase.

It might rest at some local minimum.

OK, somes no solid, but super-cooled liquid and glass. (This is insane, but reasonable, solved some partial problem of me)

More is different: exponential complexity space, but only a few phases.

Statistical mechanics from theoretical computer science perspective: computational complexity.

Task computational complexity
Partition function #P-complete
Free energy #P-complete
Observable NP-complete
Sampling NP-complete
Optimization NP-complete
Energy of one configuration P

Complexity ladders: P, NP, #P, PSPACE, BQP, BPP, ...

Complexity class Problem Example
P Polynomial Matrix multiplication
NP Non-deterministic polynomial Traveling salesman problem
#P Counting problem Partition function
PSPACE Polynomial space Quantum circuit simulation

Statistical Physics and Statistical Inference is the same thing

Given a data set D={x}D=\{x\}, where each sample xDx\in D has a label yy. If we can learn the joint distribution p(x,y)p(x,y) from the dataset, and generate unseen samples based on the label, then it's called Generative Model.

Learn: given data xx, label yy, learn p(x,y)p(x,y)

Usage:

  1. discrimitive task: p(yx)=p(x,y)/p(x)p(y|x)=p(x,y)/p(x)
  2. generative task: p(xy)=p(x,y)/p(y)p(x|y)=p(x,y)/p(y)

This seems to be a trivial Bayesian estimation, but problem will arise when we are dealing with high dimensional distribution (i.e., dimx>>1\dim{x}>>1) , since we need to fit a high dimensional curve (which will lead to curse of dimensionality).

To deal with this, we introduce some models that give a prior distribution function p(x)p(x) and learn the parameters to obtain the correct distribution.

This seems to be a trivial practice as it makes no difference from just using NNs to replace p(x,y)p(x,y)

The loss function is the difference between p(x)p(x) and sample distribution π(x)\pi ({x}), we minimize

KL(πp)=xDπ(x)ln[π(x)p(x)]=lnπpπ\mathbb{KL}(\pi||p)=\sum_{x\in D}\pi(x)\ln{\Big[\frac{\pi(x)}{p(x)}\Big]} = \lang \ln{\frac{\pi}{p}} \rang_\pi

we may simplify it as

KL(πp)=lnπpπ=lnππlnpπ\mathbb{KL}(\pi||p)=\lang \ln{\frac{\pi}{p}} \rang_\pi =\lang \ln{\pi} \rang_\pi - \lang \ln{p} \rang_\pi

in most cases, we treat π(x)=1DxDδ(xx)\pi(x)=\frac{1}{|D|}\sum_{x'\in D}\delta(x-x'), and all we need is to minimize lnpπ- \lang \ln{p} \rang_\pi, which can be simplified as

L=lnpπ=1DxDlnp(x)\mathcal{L}=- \lang \ln{p} \rang_\pi=-\frac{1}{|D|}\sum_{x\in D}\ln{p(x)}

this is the Negative Log Likelyhood (what's this?), to minimize this is to maximize the likelyhood.


Another way to understand:

Given data x\bold{x}, we want to learn the distribution p(x)p(\bold{x}).

Define it as a partition function:

Z=p(x)=sp(xs)p0(s)Z=p(\bold{x})=\sum_{\bold{s}} p(\bold{x}|\bold{s})p_0(\bold{s})

where s\bold{s} is the hidden variable, p0(s)p_0(\bold{s}) is the prior distribution of s\bold{s}, and p(xs)p(\bold{x}|\bold{s}) is the conditional distribution of x\bold{x} given s\bold{s}.

If we given p(xs)p(\bold{x}|\bold{s}) and p0(s)p_0(\bold{s}), we can compute p(xs)p(\bold{x}|s), and then p(x)p(\bold{x}).

p(xs)=p(x,s)p(s)=p(xs)p0(s)p(s)p(\bold{x}|s)=\frac{p(\bold{x},s)}{p(s)}=\frac{p(\bold{x}|s)p_0(s)}{p(s)}

Information Compression

Given y=f(x)y=f(x), we want to restore xx from yy. Where f(x)=Fxf(x)=Fx, F is a ParseError: KaTeX parse error: Undefined control sequence: \s at position 3: m \̲s̲ ̲n matrix, m<nm<n.

In principle, we can't restore xx from yy, but we can find a xx' that is close to xx. And if xx is sparse, given enough yy, we can restore xx.

A phase diagram of the problem:

Image

Taken from Krzakala et al. 2012

How is this diagram related to physics?

Why phase diagram plays central role in physics?

Why NNs?

Many more parameter efficient methods than NN, but NN is most suitable for GPU.

魑魅魍魉 Chi Mei Wang Liang

The difficulty of Generative > Discriminative

What I cannot Create, I do not Understand

Quantum Mechanics want to find p(x)=Ψ(x)2dxΨ(x)2p(x)=\frac{|\Psi(x)|^2}{\int dx |\Psi(x)|^2}

Statistical Physics want to find p(x)=eβE(x)dxeβE(x)p(x)=\frac{e^{-\beta E(x)}}{\int dx e^{-\beta E(x)}}

Parameterization of Ψ(x)\Psi(x) on Quantum hardware is not attainable. So we use Tensor Network to approximate it.

Trivial case for RBM.

Generalize ability matters. How this overcome the problem from physics?

Seemly no solution yet.

Physicist attack to Learning and Inference: An Introduction to Learning and Generalisation

GPU is all you need: Current Boom in ML is due to GPU. NNs are not the best, but the most suitable for GPU.

Success of ML = Data + Algorithms + NNs + GPU

Given that computing hardware is so important, what is the currrent solution to moore law failure?

"Nature isn't classical, dammit, and if you want to make a simulation of nature, you'd better make it quantum mechanical, and by golly it's a wonderful problem, because it doesn't look so easy." - Richard Feynman

"The brain is a computer made of meat." - Marvin Minsky

"The future of computing is light." - John Hennessy

"DNA is like a computer program but far, far more advanced than any software ever created." - Bill Gates

The above are analog computing, which have inherent problems of noise and precision. But the future of AGI might be analog computing.

Quantum Supremacy

Quantum Supremacy is the demonstration of a quantum computer performing a calculation that is beyond the reach of the most powerful supercomputers today.

Ising model with complex-number interaction.

Google: 10,000 years(Summit)-> 200 seconds(Sycamore)

The quantum state in the random circuit sampling problem is a superposition of all possible states, and it is given by:

ψ=12nx{0,1}nx\ket{\psi} = \frac{1}{\sqrt{2^n}} \sum_{x \in \{0,1\}^n} \ket{x}

where nn is the number of qubits, and x\ket{x} is the computational basis state with binary representation xx.

We apply random single-qubit gates to the initial state, and we measure the final state in the computational basis. The probability of obtaining a given outcome xx is given by:

p(x)=xUψ2p(x) = |\bra{x} U \ket{\psi}|^2

where UU is the unitary operator that represents the random circuit.

In the paper of Quantum Supremacy using a programmable superconducting processor, we have a circuit with 53 qubits, the classical computation cost can be estimated as:

ParseError: KaTeX parse error: Undefined control sequence: \s at position 8: 2^{53} \̲s̲ ̲4 \times 4

in storage and computation.

Tensornetwork: solve the problem on a hardware.

Statistical Physics of Learning

How Ising model be used for learning?

Ising model can be defined :

H=i<jJijsisj\mathcal{H} = -\sum_{i<j} J_{ij} s_i s_j

where sis_i is the spin of the ii-th particle, and JijJ_{ij} is the interaction strength between particles ii and jj.

Energy function must be Polynomial- computable. Otherwise, this model won't work. (But why)

It satisfies boltzmann distribution:

p(s)=eβH(s)Zp(s) = \frac{e^{-\beta \mathcal{H}(s)}}{Z}

where β\beta is the inverse temperature, and ZZ is the partition function.

Why boltzmann distribution?

Consider mm microstates, with nn spins.

We want to maximize the entropy:

S=sp(s)lnp(s)S = -\sum_{s} p(s) \ln{p(s)}

subject to the constraints:

<si>=1mssip(s)<sisj>=1mssisjp(s)<s_i> = \frac{1}{m} \sum_{s} s_i p(s)\\ <s_is_j> = \frac{1}{m} \sum_{s} s_i s_j p(s)

Which means the two variable equals the sample average.

Using Lagrange multiplier, we have:

L=sp(s)lnp(s)+α(sp(s)1)+iβi(ssip(s)si)+i<jβij(ssisjp(s)sisj)\mathcal{L} = -\sum_{s} p(s) \ln{p(s)} + \alpha \Big(\sum_{s} p(s) - 1\Big) + \sum_{i} \beta_i \Big(\sum_{s} s_i p(s) - \lang s_i \rang\Big) + \sum_{i<j} \beta_{ij} \Big(\sum_{s} s_i s_j p(s) - \lang s_i s_j \rang\Big)

where α\alpha, βi\beta_i, and βij\beta_{ij} are the Lagrange multipliers.

Solve for p(s)p(s), we take derivative of L\mathcal{L} with respect to p(s)p(s), and set it to zero:

Lp(s)=lnp(s)1+α+iβisi+i<jβijsisj=0\frac{\partial \mathcal{L}}{\partial p(s)} = -\ln{p(s)} - 1 + \alpha + \sum_{i} \beta_i s_i + \sum_{i<j} \beta_{ij} s_i s_j = 0

which gives:

p(s)=eαiβisii<jβijsisjZp(s) = \frac{e^{-\alpha - \sum_{i} \beta_i s_i - \sum_{i<j} \beta_{ij} s_i s_j}}{Z}

where ZZ is the partition function.

The Lagrange multipliers then be solved as:

α=lnZiβisii<jβijsisjβi=1βlnsi1siβij=1βlnsisj1sisj\alpha = -\ln{Z} - \sum_{i} \beta_i \lang s_i \rang - \sum_{i<j} \beta_{ij} \lang s_i s_j \rang\\ \beta_i = \frac{1}{\beta} \ln{\frac{\lang s_i \rang}{1 - \lang s_i \rang}}\\ \beta_{ij} = \frac{1}{\beta} \ln{\frac{\lang s_i s_j \rang}{1 - \lang s_i s_j \rang}}

where β\beta is the inverse temperature.

That's why we use Boltzmann distribution.

Curie-Weiss Model

Curie-Weiss model is a mean-field model of ferromagnetism, which was introduced by Pierre Curie and Pierre Weiss in 1906. It is used to describe the behavior of ferromagnetic materials above the Curie temperature, where the magnetic moments of the atoms are randomly oriented. The model assumes that the magnetic moments are aligned in the same direction, and it predicts that the magnetization of the material is proportional to the applied magnetic field.

Energy:

H=12i<jJsisj\mathcal{H} = -\frac{1}{2} \sum_{i<j} Js_i s_j

This contains only 1 parameter, JJ.

The solution is

m=tanh(β(n1)Jm)m = \tanh{(\beta (n-1) J m)}

where mm is the magnetization, nn is the number of particles, JJ is the interaction strength, and β\beta is the inverse temperature.

This equation has two solutions:

1 parameter saves 2 data points.

<!-- why? -->

Network Phase Diagram

Image

Taken from Dreaming Neural Networks: Forgetting spurious memories and reinforcing pure ones

To learn from data, the weights of the network weights JijJ_{ij} are chosen as:

Jij=arg minJij(βμ=1mi<jxiμxjμJijlogZ)J_{ij} = \argmin_{J_{ij}} \Big(-\beta \sum_{\mu=1}^m \sum_{i<j} x_i^\mu x_j^\mu J_{ij} - \log Z\Big)

Which is derived from the maximum likelyhood estimation.

To minimize it, we take derivative of it with respect to JijJ_{ij}:

LJij=βμ=1mxiμxjμ+logZJij=βμ=1mxiμxjμ+βμ=1mxiμxjμJ=β(xixjdataxixjJ)\frac{\partial \mathcal{L}}{\partial J_{ij}} = -\beta \sum_{\mu=1}^m x_i^\mu x_j^\mu + \frac{\partial \log Z}{\partial J_{ij}}\\ = -\beta \sum_{\mu=1}^m x_i^\mu x_j^\mu + \beta \sum_{\mu=1}^m \lang x_i^\mu x_j^\mu \rang_J\\ = -\beta ( \lang x_i x_j \rang_\text{data} - \lang x_i x_j \rang_J)

where xixjdata\lang x_i x_j \rang_\text{data} is the average of xixjx_i x_j over the data, and xixjJ\lang x_i x_j \rang_J is the average of xixjx_i x_j over the distribution defined by the weights JijJ_{ij}.

This is inverse Ising problem. Because we want to find JijJ_{ij} from xix_i. i.e. design Hamiltonian to fit the data.

Phase diagram is so cool, it has been applied in NN like here: Image Taken from Phase diagram for two-layer ReLU neural networks and infinite-width limit

The above model only considers two body interaction, but in reality, there's higher order interaction. (Not discussed yet)

Consider Ising model with hidden variable:

E=i,asiσaJia+i<jsisjJij+a<bσaσbJabE = \sum_{i, a}s_i\sigma_a J_{ia} + \sum_{i<j} s_i s_j J_{ij} + \sum_{a<b} \sigma_a \sigma_b J_{ab}

where sis_i is the spin of the ii-th particle, σa\sigma_a is the spin of the aa-th hidden variable, JiaJ_{ia} is the interaction strength between particle ii and hidden variable aa, JijJ_{ij} is the interaction strength between particles ii and jj, and JabJ_{ab} is the interaction strength between hidden variables aa and bb.

But we can only observe sis_i, so we need to marginalize over σa\sigma_a:

p(s)=σp(s,σ)=σeβE(s,σ)Zp(s) = \sum_{\sigma} p(s,\sigma) = \sum_{\sigma} \frac{e^{-\beta E(s,\sigma)}}{Z}

The loss is

L=μ=1mlogσeβE(xμ,σ)/Z\mathcal{L} = \sum_{\mu=1}^m \log \sum_{\sigma} e^{-\beta E(x^\mu,\sigma)}/Z

This is the same as the above.

Similar to the above, we take derivative of it with respect to JiaJ_{ia}:

LJia=βμ=1mxiμσaμ+βμ=1mxiμσaμJ=β(xiσadata + hiddenxiσaJ)\frac{\partial \mathcal{L}}{\partial J_{ia}} = -\beta \sum_{\mu=1}^m x_i^\mu \sigma_a^\mu + \beta \sum_{\mu=1}^m \lang x_i^\mu \sigma_a^\mu \rang_J\\ = -\beta ( \lang x_i \sigma_a \rang_\text{data + hidden} - \lang x_i \sigma_a \rang_J)

This is slow using MCMC. Hinton proposed a faster way: RBM. Which is a bipartite graph where hidden variable is independent of each other, and visible variable is independent of each other too.

Hinton proposed Contrastive Divergence (CD) algorithm to solve this problem.

Implicit vs Explicit Generative Models

Given a parameterized distribution pθ(x)p_\theta(x), if we can explicitly compute the probability of a given sample xx, then it is called an explicit generative model. Otherwise, it is called an implicit generative model.

Is RBM explicit or implicit?

Since pθ(x)=eβE(x)Zp_\theta(x) = \frac{e^{-\beta E(x)}}{Z}, we cannot explicitly compute the probability of a given sample xx, so it is an implicit generative model.

Is Flow model explicit or implicit?

Since pθ(x)=p0(z)i=1nfi(zi)p_\theta(x) = p_0(z) \prod_{i=1}^n f_i(z_i), we can explicitly compute the probability of a given sample xx, so it is an explicit generative model.

Is VAE explicit or implicit?

Since pθ(x)=dzpθ(xz)p0(z)p_\theta(x) = \int dz p_\theta(x|z) p_0(z), we cannot explicitly compute the probability of a given sample xx, so it is an implicit generative model.

Is Autoregressive model explicit or implicit?

Since pθ(x)=i=1npθ(xix<i)p_\theta(x) = \prod_{i=1}^n p_\theta(x_i|x_{<i}), we can explicitly compute the probability of a given sample xx, so it is an explicit generative model.

GPT uses autoregressive model, so it is an explicit generative model.

Should the parameter be shared within the model?


Hong Zhao, XMU, 2023-07-24 09:00:21

"Statistical Physics is not of much use in Machine Learning, Statistical Physics maximize the entropy, but ML minimize it."

Serious?

Statistical Physics or Statistics?

Main topics in Statistical Physics:

Statistical Physics is yet to be fully developed compared to classical mechanics. No clear boundaries of application between these theories

No physics theory is possible to describe the world we mainly see. That's why ML rises.

水流,天上的云,吹刮的风,身边的一切都处在远离平衡态的状态。

Statistics have the following topics:

Formal Definition of Statistical Physics

Given data x=(x1,x2,,xn,p1,p2,,pn)x = (x_1, x_2, \dots, x_n, p_1, p_2, \dots, p_n) in Γ\Gamma , where xix_i is the coordinate of the ii-th particle, pip_i is the momentum of the ii-th particle, nn is the number of particles.

(Micro-canonical ensemble) The probability of a given microstate xx is given by:

p(x)=1Ω(E)δ(EE(x))p(x) = \frac{1}{\Omega(E)} \delta(E - E(x))

where Ω(E)\Omega(E) is the number of microstates with energy EE, and δ(EE(x))\delta(E - E(x)) is the Dirac delta function.

(Canonical ensemble) The probability of a given microstate xx is given by:

p(x)=1ZeβE(x)p(x) = \frac{1}{Z} e^{-\beta E(x)}

where ZZ is the partition function, β\beta is the inverse temperature, and E(x)E(x) is the energy of the microstate xx.

(Macro-canonical ensemble) The probability of a given microstate xx is given by:

p(x)=1ΞeβE(x)eαN(x)p(x) = \frac{1}{\Xi} e^{-\beta E(x)} e^{-\alpha N(x)}

where Ξ\Xi is the grand partition function, β\beta is the inverse temperature, α\alpha is the chemical potential, E(x)E(x) is the energy of the microstate xx, and N(x)N(x) is the number of particles in the microstate xx.

(Bose-Einstein distribution) The probability of a given microstate xx is given by:

p(x)=1Zi=1n1eβ(Eiμ)1p(x) = \frac{1}{Z} \prod_{i=1}^n \frac{1}{e^{\beta (E_i - \mu)} - 1}

where ZZ is the partition function, β\beta is the inverse temperature, μ\mu is the chemical potential, EiE_i is the energy of the ii-th particle, and nn is the number of particles.

(Fermi-Dirac distribution) The probability of a given microstate xx is given by:

p(x)=1Zi=1n1eβ(Eiμ)+1p(x) = \frac{1}{Z} \prod_{i=1}^n \frac{1}{e^{\beta (E_i - \mu)} + 1}

where ZZ is the partition function, β\beta is the inverse temperature, μ\mu is the chemical potential, EiE_i is the energy of the ii-th particle, and nn is the number of particles.

Now we have a distribution over Γ\Gamma, but what we see is an average over tt. How to resolve this?

Ergodicity: average = ensemble average

Ergodicity hypothesis: the average of a physical quantity is equal to the ensemble average.

limt1t0tA(x(t))dt=A\lim_{t \to \infty} \frac{1}{t} \int_0^t A(x(t')) dt' = \lang A \rang

where A(x(t))A(x(t')) is the value of the physical quantity AA at tt', and A\lang A \rang is the ensemble average of the physical quantity AA.

Problem related to this hypothesis:

Difference from statistics:

Why statistical physics have no mechanics in it? (i.e. no dynamic equation)

Liouville's theorem: the phase space volume is conserved under Hamiltonian dynamics.

But it cannot provide proof for ergodicity or equa-partition theorem.

Consider Noisy-Dynamics?

"Random School" developed by Einstein, Smoluchowski, Langevin, Fokker-Planck, Kolmogorov, etc.

Boltzmann equation:

ft+vf=ft=C[f]\frac{\partial f}{\partial t} + \mathbf{v} \cdot \nabla f = \frac{\partial f}{\partial t} = \mathcal{C}[f]

where ff is the probability distribution function, v\mathbf{v} is the velocity, and C[f]\mathcal{C}[f] is the collision operator.

Learning Machine

If the output can feedback to the system, then it's a dynamic system.

Algorithm:

When to use linear and when non-linear?

Can all data be linearly separable under high dimension?

Linear transformation won't span the space dimension. But non-linear could.

Occam's Razor

如无必要,勿增实体

Can we identify the shape of the drum film given the sound we here?

Math proof: No

Learning Machine does not obey Occam's Razor.

NN dynamics and it's weight spectrum

Image Taken from Correlation Between Eigenvalue Spectra and Dynamics of Neural Networks

The main contributions of the paper are:

Other Model-free methods

Which one is the best?

No camparison found in literature.

Dynamic Systems: A Prior

Lorentz system:

dxdt=σ(yx)dydt=x(ρz)ydzdt=xyβz\frac{dx}{dt} = \sigma (y - x)\\ \frac{dy}{dt} = x (\rho - z) - y\\ \frac{dz}{dt} = xy - \beta z

where xx, yy, and zz are the state variables, σ\sigma, ρ\rho, and β\beta are the parameters, and tt is the .

It exhibits chaotic behavior for σ=10\sigma = 10, ρ=28\rho = 28, and β=8/3\beta = 8/3.

Phase space reconstruction:

xi=(xi,xiτ,xi2τ,,xi(m1)τ)\mathbf{x}_i = (x_i, x_{i-\tau}, x_{i-2\tau}, \dots, x_{i-(m-1)\tau})

where xix_i is the ii-th data point, τ\tau is the , and mm is the embedding dimension.

Benchmark is the key to solid fundation.

Do phase space reconstruction apply to problems without model but only data?

Yes, because it's a model-free method. It can predict the future of the system.

Reservoir Computing

Reservoir computing is a machine learning framework for training recurrent neural networks. It was introduced by Herbert Jaeger in 2001, and it is inspired by the echo state network approach of Jürgen Schmidhuber.

Theory:

r(t)=tanh(Winu(t)+Wr(t1))\mathbf{r}(t) = \tanh{(\mathbf{W}^\text{in} \mathbf{u}(t) + \mathbf{W} \mathbf{r}(t-1))}

where r(t)\mathbf{r}(t) is the state vector at tt, Win\mathbf{W}^\text{in} is the input weight matrix, u(t)\mathbf{u}(t) is the input vector at time tt, W\mathbf{W} is the recurrent weight matrix, and tanh\tanh is the hyperbolic tangent function.

Training procedure:

The target output matrix Y\mathbf{Y} is computed as follows:

Y=RWout(Wout)T(RRT+λI)1\mathbf{Y} = \mathbf{R} \mathbf{W}^\text{out} (\mathbf{W}^\text{out})^T (\mathbf{R} \mathbf{R}^T + \lambda \mathbf{I})^{-1}

where R\mathbf{R} is the state matrix, Wout\mathbf{W}^\text{out} is the output weight matrix, λ\lambda is the regularization parameter, and I\mathbf{I} is the identity matrix.

The error function EE is computed as follows:

E=12t=1Ty(t)Y(t)2E = \frac{1}{2} \sum_{t=1}^T \|\mathbf{y}(t) - \mathbf{Y}(t)\|^2

where TT is the number of steps, y(t)\mathbf{y}(t) is the output vector at time tt, and Y(t)\mathbf{Y}(t) is the target output vector at time tt.

What is the ridge regression?

Ridge regression is a regularized version of linear regression. It is used to prevent overfitting in linear regression.

Systemetic Prediction based on Periodic Orbits

Monte Carlo Methods

Jiao Wang, XMU, 2023-07-24 13:01:59

Quantum Annealing

Formulation of quantum annealing

Quantum annealing is a metaheuristic for finding the global minimum of a given objective function over a given set of candidate solutions (candidate states), by a process using quantum fluctuations. Quantum annealing is used mainly for problems where the search space is discrete (combinatorial optimization problems) with many local minima; such as finding the ground state of a spin glass.

Transverse field Ising model:

H=A(s)2i=1nσixB(s)2i<jJijσizσjz\mathcal{H} = - \frac{A(s)}{2} \sum_{i=1}^n \sigma_i^x - \frac{B(s)}{2} \sum_{i<j} J_{ij} \sigma_i^z \sigma_j^z

where A(s)A(s) is the transverse field, B(s)B(s) is the longitudinal field, σix\sigma_i^x is the Pauli-X operator on the ii-th qubit, σiz\sigma_i^z is the Pauli-Z operator on the ii-th qubit, JijJ_{ij} is the interaction strength between the ii-th and jj-th qubits, and nn is the number of qubits.

evolution of the system is given by:

H(0)=A(0)2i=1nσixH(1)=B(1)2i<jJijσizσjz\mathcal{H}(0) = - \frac{A(0)}{2} \sum_{i=1}^n \sigma_i^x \to \mathcal{H}(1) = - \frac{B(1)}{2} \sum_{i<j} J_{ij} \sigma_i^z \sigma_j^z

where H(0)\mathcal{H}(0) is the Hamiltonian at 00, and H(1)\mathcal{H}(1) is the Hamiltonian at time 11.

An problem: Phase transition of Ising model

density of defect ρ\rho vs the annealling tst_s.

how to obtain this relationship ρ(ts)\rho(t_s)?

Kibble-Zurek mechanism:

The Kibble-Zurek mechanism (KZM) describes the non-equilibrium dynamics of a system undergoing a continuous phase transition. It was proposed by T. W. B. Kibble in 1976 and independently by Wojciech Zurek in 1985. The KZM is based on the idea that when a system is driven through a continuous phase transition at a finite rate, the system remains in thermal equilibrium during the transition. This assumption allows one to estimate the density of topological defects in the system as a function of the rate of transition.

ρ(ts)(tstQ)12\rho(t_s) \sim \Big(\frac{t_s}{t_Q}\Big)^{\frac{1}{2}}

where tQt_Q is the quantum critical .

the theory is

logρ=log(12πh2J)12logts\log \rho = \log(\frac{1}{2\pi}\sqrt{\frac{h}{2J}}) - \frac{1}{2} \log t_s

where hh is the transverse field, and JJ is the interaction strength.

This theory have no experiment parameter, but reaches good alignment with experiment data.

This is the first case to show that quantum annealing can be used to solve a problem that is not efficiently solvable by classical computers using 5300 qubits.

Q: Include the long range interaction, will it be better?

A: Maybe, no verificaton yet.

Q: They reached L=2000L=2000 noise-free qubits, but under only 50ns50\text{ns} annealing . (environment noise is the main problem)

Image

Wikipedia is doing a great job in democrotizing knowledge.

Dynamics of cell state transition

Jianhua Xing, University of Pittsburgh

Mathematical considerations:

dzdt=A(z)+ξ(z,t)\frac{d\vec{z}}{dt} = \bold{A(\vec{z})} + \xi(\vec{z},t)

why looks like this?

A(z)\bold{A(\vec{z})} is the deterministic part, ξ(z,t)\xi(\vec{z},t) is the stochastic part.

where z\vec{z} is a state vector specifying the complete internal state of the cell at a given .

The problem became how to find A(z)\bold{A(\vec{z})} and ξ(z,t)\xi(\vec{z},t). This can be defined as a variational problem:

minA(z),ξ(z,t)(120TdzdtA(z)ξ(z,t)2dt+λ20TA(z)2dt)\min_{\bold{A(\vec{z})}, \xi(\vec{z},t)} \Big(\frac{1}{2} \int_0^T \|\frac{d\vec{z}}{dt} - \bold{A(\vec{z})} - \xi(\vec{z},t)\|^2 dt + \frac{\lambda}{2} \int_0^T \|\bold{A(\vec{z})}\|^2 dt\Big)

where λ\lambda is the regularization parameter.

We parameterize A(z)\bold{A(\vec{z})} as:

A(z)=i=1nAiϕi(z)\bold{A(\vec{z})} = \sum_{i=1}^n \bold{A_i} \phi_i(\vec{z})

where ϕi(z)\phi_i(\vec{z}) is the ii-th basis function, and Ai\bold{A_i} is the ii-th coefficient.

ϕi(z)=12πe12(zzi)TΣ1(zzi)\phi_i(\vec{z}) = \frac{1}{\sqrt{2\pi}} e^{-\frac{1}{2} (\vec{z} - \vec{z}_i)^T \Sigma^{-1} (\vec{z} - \vec{z}_i)}

is this form necessary as a prior knowledge?

In practice, you can use other basis too.

How is the noise handled?

Here we use white noise, given enough data, we can learn it too.

where zi\vec{z}_i is the ii-th center, and Σ\Sigma is the covariance matrix.

We give the change of n(z,t)n(\vec{z},t) (number density) w.r.t. the diffusion constant DD as:

n(z,t)t=[J(z)n(z,t)+Dn(z,t)]+dzK(z,z)n(z,t)\frac{\partial n(\vec{z},t)}{\partial t} = \nabla [- \bold{J}(\vec{z}) n(\vec{z},t) + \bold{D} \nabla n(\vec{z},t)] + \int d\vec{z}' \bold{K}(\vec{z},\vec{z}') n(\vec{z}',t)

where J(z)\bold{J}(\vec{z}) is the drift velocity, D\bold{D} is the diffusion constant, and K(z,z)\bold{K}(\vec{z},\vec{z}') is the transition rate.

This is a Fokker-Planck equation.

A specific ansatz of the dynamical euqation is:

dxdt=f(x)+(σx)η(t)n(x,t)t=(nf)+(Dn)+μ(x)n\frac{d\bold{x}}{dt} = f(\bold{x}) + (\bold{\sigma}\bold{x}) \bold{\eta}(t)\\ \frac{\partial n(\bold{x},t)}{\partial t} = -\nabla (n\bold{f}) + \nabla \cdot (\bold{D} \nabla n) + \mu(\bold{x}) n

Have you considered 各向异性(isotropy) of the cell?

No, we work in the phase space of cell.

Human population behavior and propagation dynamics

Qingyan Hu, 南方科技大学,复杂流动及软物中心,统计与数据科学系

information spreading dynamics

Information spreading dynamics contain three key elements:

We model the information spreading dynamics as:

βi=αix(1γ)fi(x)\beta_i = \alpha_i x (1-\gamma)^{f_i(x)}

where βi\beta_i is the probability of user ii to spread the information, αi\alpha_i is the information feature of user ii, xx is the fraction of users who have received the information, γ\gamma is the network structure, and fi(x)f_i(x) is the user attribute of user ii.

fi(x)=11+eλi(xθi)f_i(x) = \frac{1}{1+e^{-\lambda_i(x-\theta_i)}}

where λi\lambda_i and θi\theta_i are parameters that depend on the user's attribute, such as age, education, income, etc. The paper says that this function is a logistic function that models the user's cognitive psychology. It means that the user's probability of receiving the information depends on how much the information matches their prior knowledge and beliefs.

Classical theory: social enforcement

βi=αix(1γ)fi(x)+λjNiβj\beta_i = \alpha_i x (1-\gamma)^{f_i(x)} + \lambda \sum_{j \in \mathcal{N}_i} \beta_j

where Ni\mathcal{N}_i is the set of neighbors of user ii, and λ\lambda is the social enforcement parameter.

Have your theory condiserted the inhomoegeneity of different users?

No, αi\alpha_i is the same for all users. But it can be different.

Here we use

fi(x)=ln[βi(x)/βi(1)]lnxln(1γ)+1f_i(x) = \frac{\ln {[\beta_i(x)/\beta_i(1)]} - \ln x}{\ln{(1-\gamma)}} + 1

where βi(x)\beta_i(x) is the probability of user ii to spread the information when the fraction of users who have received the information is xx.

We simplify the model as:

fi(x)=xωif_i(x)=x^{\omega_i}

where ωi\omega_i is the user attribute of user ii.

Disease spreading dynamics

The SIR model is a compartmental model that describes the dynamics of disease spreading in a population. It is a system of ordinary differential equations that models the number of susceptible, infected, and recovered individuals in a population over .

dSdt=βISdIdt=βISγIdRdt=γI\frac{dS}{dt} = -\beta I S\\ \frac{dI}{dt} = \beta I S - \gamma I\\ \frac{dR}{dt} = \gamma I

where SS is the number of susceptible individuals, II is the number of infected individuals, RR is the number of recovered individuals, β\beta is the infection rate, and γ\gamma is the recovery rate.

In covid, the quarantine τ\tau is related to RR by

R(τ)R0τTˉeQˉTˉcR(\tau) \approx R_0 \frac{\tau - \bar{T}_e}{\bar{Q}\bar{T}_c}

where R0R_0 is the basic reproduction number, Tˉe\bar{T}_e is the average incubation period, Qˉ\bar{Q} is the average quarantine , and Tˉc\bar{T}_c is the average recovery time.

The ciritical quarantine τc\tau_c is related to R0R_0 given infinite average degree

τcTˉe+QˉTˉcR0\tau_c \approx \bar{T}_e + \frac{\bar{Q}\bar{T}_c}{R_0}

Have you considerted the spacial factor in disease spreading

No, but can .

Spiking Neural Networks

Integrate-and-fire model:

CdVdt=gL(VEL)+I(t)C \frac{dV}{dt} = -g_L(V-E_L) + I(t)

where CC is the membrane capacitance, VV is the membrane potential, gLg_L is the leak conductance, ELE_L is the leak reversal potential, and I(t)I(t) is the input current.

The membrane potential VV is reset to VrV_r when it reaches the threshold VthV_\text{th}.

VVr if VVthV \to V_r \text{ if } V \ge V_\text{th}

Spiking neural network claims to have low power usage, and are more robust to noise.

The future of computation: Neuromorphic computing

Trainning SNN is hard, because it's not differentiable, so backpropagation cannot be used.

Mimic the brain activity to learn is enough, we don't need to know how the brain works.

Training methods:

Method Pros Cons
ANN-SNN conversion Easy to implement Low accuracy
Approximate gradient High accuracy High complexity
Neuro synaptic dynamics High accuracy Low performance

They allpy STDP to train the network.

STDP is a biological process that adjusts the strength of connections between neurons in the brain. The process adjusts the connection strengths based on the relative timing of the input signals to pairs of neurons. The STDP process partially explains the activity-dependent development of nervous systems, especially with regard to long-term potentiation and synaptic plasticity.

Δwij={A+eΔtτ+if Δt>0AeΔtτif Δt<0\Delta w_{ij} = \begin{cases} A_+ e^{-\frac{\Delta t}{\tau_+}} & \text{if } \Delta t > 0\\ -A_- e^{-\frac{\Delta t}{\tau_-}} & \text{if } \Delta t < 0 \end{cases}

where Δwij\Delta w_{ij} is the change in the synaptic weight between the ii-th and jj-th neurons, A+A_+ is the amplitude of the positive change, AA_- is the amplitude of the negative change, τ+\tau_+ is the constant of the positive change, τ\tau_- is the time constant of the negative change, and Δt\Delta t is the time difference between the spikes of the ii-th and jj-th neurons.

They proposed a new method to train SNN called DA-STDP, which is based on the dopamine reward signal.

DA-STDP is based on the following equation:

Δwij={A+eΔtτ+if Δt>0AeΔtτif Δt<0+{αif Δt>0βif Δt<0\Delta w_{ij} = \begin{cases} A_+ e^{-\frac{\Delta t}{\tau_+}} & \text{if } \Delta t > 0\\ -A_- e^{-\frac{\Delta t}{\tau_-}} & \text{if } \Delta t < 0 \end{cases} + \begin{cases} \alpha & \text{if } \Delta t > 0\\ -\beta & \text{if } \Delta t < 0 \end{cases}

where α\alpha is the reward for the positive change, and β\beta is the reward for the negative change.

the SNN is implemented in hardware

Probalistic inference reformulated as tensor networks

Jin-Guo Liu, HKUST

Reformulate probalistc inference as tensor networks

Probabilistic inference is the task of computing the posterior distribution of the hidden variables given the observed variables.

p(hv)=p(h,v)p(v)p(\vec{h}|\vec{v}) = \frac{p(\vec{h},\vec{v})}{p(\vec{v})}

where h\vec{h} is the hidden variables, and v\vec{v} is the observed variables.

The joint distribution of the hidden variables and the observed variables is given by:

p(h,v)=1Zi=1nϕi(h,v)p(\vec{h},\vec{v}) = \frac{1}{Z} \prod_{i=1}^n \phi_i(\vec{h},\vec{v})

where ZZ is the partition function, and ϕi(h,v)\phi_i(\vec{h},\vec{v}) is the potential function.

why the joint distribution is like this?

This is the definition of the joint distribution.

Graph probabilistic model: the joint distribution is given by:

p(h,v)=1Zi=1nϕi(hi,vi)j=1mϕj(hj)p(\vec{h},\vec{v}) = \frac{1}{Z} \prod_{i=1}^n \phi_i(\vec{h}_i,\vec{v}_i) \prod_{j=1}^m \phi_j(\vec{h}_j)

where hi\vec{h}_i is the hidden variables of the ii-th node, vi\vec{v}_i is the observed variables of the ii-th node, ϕi(hi,vi)\phi_i(\vec{h}_i,\vec{v}_i) is the potential function of the ii-th node, ϕj(hj)\phi_j(\vec{h}_j) is the potential function of the jj-th edge, and ZZ is the partition function.

Where is the graph here?

The graph is the structure of the joint distribution.

Probabilistic Model:

<!-- See TensorInference.jl -->

Variational Autoregressive Network

Pan Zhang , ITP, 2023-07-28 14:03:12

Variational methods in statistical mechanics

Mean field assumptions:

q(s)=i=1nqi(si)q(\bold{s})= \prod_{i=1}^n q_i(s_i)

where q(s)q(\bold{s}) is the variational distribution, qi(si)q_i(s_i) is the variational distribution of the ii-th spin.

Bethe ansatz

q(s)=i=1nqi(si)i<jnqij(si,sj)qi(si)qj(sj)q(\bold{s})= \prod_{i=1}^n q_i(s_i) \prod_{i<j}^n \frac{q_{ij}(s_i,s_j)}{q_i(s_i)q_j(s_j)}

where q(s)q(\bold{s}) is the variational distribution, qi(si)q_i(s_i) is the variational distribution of the ii-th spin, and qij(si,sj)q_{ij}(s_i,s_j) is the variational distribution of the ii-th and jj-th spins.

Chemical reaction simulation with VAN

See lecture note of Yin Tang.

Machine learning and chaos

Question:

Given

x˙=f(x,r)\dot{x}= f(x,r)

where

f(a+b)f(a)+f(b)f(a+b)\ne f(a)+ f(b)

Predict the final xx given the initial x0x_0.

A simplist example is logistic map

xn+1=rxn(1xn)x_{n+1} = rx_n(1-x_n)

For stability, rr must be restricted to [0,4][0,4], we found that for some rr, the period of the system TT\to \infty.

Simple code to replicate it

def logistic(a):
    x = [0.3]
    for i in range(400):
        x.append(a * x[-1] * (1 - x[-1]))
    return x[-100:]

for a in np.linspace(2.0, 4.0, 1000):
    x = logistic(a)
    plt.plot([a]*len(x), x, "c.", markersize=0.1)

plt.xlabel("r")
plt.ylabel("x_f")
plt.show()

Image

Another example is the Lorenz system:

dxdt=σ(yx)dydt=x(ρz)ydzdt=xyβz\frac{dx}{dt} = \sigma (y - x)\\ \frac{dy}{dt} = x (\rho - z) - y\\ \frac{dz}{dt} = xy - \beta z

where xx, yy, and zz are the state variables, σ\sigma, ρ\rho, and β\beta are the parameters, and tt is the .

Fractional ordinary differential equation (FODE):

dαxdtα=f(x,t)\frac{d^\alpha x}{dt^\alpha} = f(x,t)

where α\alpha is the order of the fractional derivative, xx is the state variable, tt is the , and f(x,t)f(x,t) is the function of the state variable and the time.

For example, the fractional derivative of order α\alpha of the function x(t)x(t) is given by:

dαxdtα=1Γ(1α)ddt0tx(t)(tt)αdt\frac{d^\alpha x}{dt^\alpha} = \frac{1}{\Gamma(1-\alpha)} \frac{d}{dt} \int_0^t \frac{x(t')}{(t-t')^\alpha} dt'

where Γ\Gamma is the gamma function.

What is the physical meaning of the fractional derivative?

It is the memory of the system. The fractional derivative of order α\alpha is the memory of the system of length α\alpha.

Feature of chaos

This is the property that the system is sensitive to initial conditions.

Mathematical definition:

ϵ>0,xO,yO,nN, such that fn(x)fn(y)>ϵ\exists \epsilon > 0, \forall x \in \mathcal{O}, \exists y \in \mathcal{O}, \exists n \in \mathbb{N}, \text{ such that } \|f^n(x) - f^n(y)\| > \epsilon

Lynapunov exponent:

λ=limt1tlogdx(t)dx(0)\lambda = \lim_{t\to\infty} \frac{1}{t} \log \Big|\frac{dx(t)}{dx(0)}\Big|

where λ\lambda is the Lynapunov exponent, x(t)x(t) is the state variable at tt, and x(0)x(0) is the state variable at time 00.

Example: logistic map with r=4r=4 and x0=0.2x_0=0.2 has λ=0.69\lambda = 0.69.

If λ>0\lambda > 0, the system is chaotic.

This is the property that the system will eventually reach any state in the phase space.

Matheatical definition:

U,VO,NN,nN,fn(U)V\forall U, V \in \mathcal{O}, \exists N \in \mathbb{N}, \forall n \ge N, f^n(U) \cap V \ne \emptyset

where O\mathcal{O} is the phase space, UU and VV are two open sets in the phase space, NN is a natural number, nn is a natural number, fn(U)f^n(U) is the nn-th iteration of the set UU, and \emptyset is the empty set.

This is the property that the system has infinite periodic orbits.

Mathematical definition:

xO,ϵ>0,yO,nN, such that fn(x)y<ϵ\forall x \in \mathcal{O}, \forall \epsilon > 0, \exists y \in \mathcal{O}, \exists n \in \mathbb{N}, \text{ such that } \|f^n(x) - y\| < \epsilon

where O\mathcal{O} is the phase space, xx is a point in the phase space, ϵ\epsilon is a positive real number, yy is a point in the phase space, nn is a natural number, fn(x)f^n(x) is the nn-th iteration of the point xx, and fn(x)y\|f^n(x) - y\| is the distance between the nn-th iteration of the point xx and the point yy.

How ML is doing better than traditional methods?

Two questions in chaos study, under the condition that the system dynamics is unknown(model-free):

Essence of statistical learning: i.i.d. assumption

Chaos synchronization

Chaos synchronization: given a coupled chaotic oscillator, we want to synchronize the two oscillators.

Oscillator 1:

dx1dt=f(x1)+ϵ(x2x1)\frac{dx_1}{dt} = f(x_1) + \epsilon (x_2 - x_1)

Oscillator 2:

dx2dt=f(x2)+ϵ(x1x2)\frac{dx_2}{dt} = f(x_2) + \epsilon (x_1 - x_2)

where x1x_1 is the state variable of oscillator 1, x2x_2 is the state variable of oscillator 2, f(x1)f(x_1) is the dynamics of oscillator 1, f(x2)f(x_2) is the dynamics of oscillator 2, and ϵ\epsilon is the coupling strength. This coupling is called linear coupling, or diffusive coupling, because the coupling term is proportional to the difference between the two oscillators.

How is this related to diffusion?

The coupling term is proportional to the difference between the two oscillators, which is similar to the diffusion term in the diffusion equation: ut=D2ux2\frac{\partial u}{\partial t} = D \frac{\partial^2 u}{\partial x^2}.

(x1,y1,z1)=(x2,y2,z2),ϵ>ϵc(x_1,y_1,z_1) = (x_2,y_2,z_2), \epsilon > \epsilon_c

which means that the two oscillators have the same state variables when the coupling strength is greater than the critical coupling strength.

θ1=θ2+c,ϵ>ϵp\theta_1 = \theta_2 +c, \epsilon > \epsilon_p

where θ1\theta_1 is the phase of oscillator 1, θ2\theta_2 is the phase of oscillator 2, cc is a constant, and ϵp\epsilon_p is the critical coupling strength. This means that the two oscillators have the same phase when the coupling strength is greater than the critical coupling strength.

How is the pahse θ\theta defined ?

θ\theta is the angle of the oscillator in the phase space. For example, for the logistic map, θ\theta is the angle of the point (xn,xn+1)(x_n, x_{n+1}) in the phase space.

x1=G(x2),ϵ>ϵgx_1 = G(x_2), \epsilon > \epsilon_g

where x1x_1 is the state variable of oscillator 1, x2x_2 is the state variable of oscillator 2, GG is a function, and ϵg\epsilon_g is the critical coupling strength. This means that the two oscillators have the same state variables when the coupling strength is greater than the critical coupling strength.

Science is good, but engineering will make it great.

Why is chaos synchronization important?

Real systems are coupled, and chaos synchronization is a way to model the coupling.

Can we predict the synchronization point ϵc\epsilon_c given some x1,2(ϵi,t),i{1,2,3}x_{1,2}(\epsilon_i,t), i\in\{1,2,3\}?

One method: apply reservior computing.

To produce predictions for an unknow non-linear system is in principle difficult. Here we review some methods to predict chaotic system based on data/model, trying to grasp the essence of chaos.

Definition and Characterization of Chaos

Question:

Given

x˙=f(x,r)\dot{x}= f(x,r)

where

f(a+b)f(a)+f(b)f(a+b)\ne f(a)+ f(b)

Predict the final xx given the initial x0x_0.

A simplist example is logistic map

xn+1=rxn(1xn)x_{n+1} = rx_n(1-x_n)

For stability, rr must be restricted to [0,4][0,4], we found that for some rr, the period of the system TT\to \infty.

Simple code to replicate it

def logistic(a):
    x = [0.3]
    for i in range(400):
        x.append(a * x[-1] * (1 - x[-1]))
    return x[-100:]

for a in np.linspace(2.0, 4.0, 1000):
    x = logistic(a)
    plt.plot([a]*len(x), x, "c.", markersize=0.1)

plt.xlabel("r")
plt.ylabel("x_f")
plt.show()

Image

Another example is the Lorenz system:

dxdt=σ(yx)dydt=x(ρz)ydzdt=xyβz\frac{dx}{dt} = \sigma (y - x)\\ \frac{dy}{dt} = x (\rho - z) - y\\ \frac{dz}{dt} = xy - \beta z

where xx, yy, and zz are the state variables, σ\sigma, ρ\rho, and β\beta are the parameters, and tt is the time. The Lorenz system has chaotic solutions for some parameter values and initial conditions. In particular, for σ=10\sigma = 10, β=8/3\beta = 8/3, and ρ=28\rho = 28, the Lorenz system exhibits chaotic solutions for many initial conditions.

Image

Feature of chaos

This is the property that the system is sensitive to initial conditions.

Mathematical definition:

ϵ>0,xO,yO,nN, such that fn(x)fn(y)>ϵ\exists \epsilon > 0, \forall x \in \mathcal{O}, \exists y \in \mathcal{O}, \exists n \in \mathbb{N}, \text{ such that } \|f^n(x) - f^n(y)\| > \epsilon

Lynapunov exponent:

λ=limt1tlogdx(t)dx(0)\lambda = \lim_{t\to\infty} \frac{1}{t} \log \Big|\frac{dx(t)}{dx(0)}\Big|

where λ\lambda is the Lynapunov exponent, x(t)x(t) is the state variable at tt, and x(0)x(0) is the state variable at time 00.

Example: logistic map with r=4r=4 and x0=0.2x_0=0.2 has λ=0.69\lambda = 0.69.

If λ>0\lambda > 0, the system is chaotic.

This is the property that the system will eventually reach any state in the phase space.

Matheatical definition:

U,VO,NN,nN,fn(U)V\forall U, V \in \mathcal{O}, \exists N \in \mathbb{N}, \forall n \ge N, f^n(U) \cap V \ne \emptyset

where O\mathcal{O} is the phase space, UU and VV are two open sets in the phase space, NN is a natural number, nn is a natural number, fn(U)f^n(U) is the nn-th iteration of the set UU, and \emptyset is the empty set.

This is the property that the system has infinite periodic orbits.

Mathematical definition:

xO,ϵ>0,yO,nN, such that fn(x)y<ϵ\forall x \in \mathcal{O}, \forall \epsilon > 0, \exists y \in \mathcal{O}, \exists n \in \mathbb{N}, \text{ such that } \|f^n(x) - y\| < \epsilon

where O\mathcal{O} is the phase space, xx is a point in the phase space, ϵ\epsilon is a positive real number, yy is a point in the phase space, nn is a natural number, fn(x)f^n(x) is the nn-th iteration of the point xx, and fn(x)y\|f^n(x) - y\| is the distance between the nn-th iteration of the point xx and the point yy.

Prediction of Chaos

Two questions in chaos study, under the condition that the system dynamics is unknown(model-free):

Image Image

Image

Image

Image

A paper in 2001 use Reservoir Computing to predict chaos evolution, which is a model-free method. Here we reproduce it.

Problem Formulation

Chaos synchronization: given a coupled chaotic oscillator, we want to synchronize the two oscillators.

Oscillator 1:

dx1dt=f(x1)+ϵ(x2x1)\frac{dx_1}{dt} = f(x_1) + \epsilon (x_2 - x_1)

Oscillator 2:

dx2dt=f(x2)+ϵ(x1x2)\frac{dx_2}{dt} = f(x_2) + \epsilon (x_1 - x_2)

where x1x_1 is the state variable of oscillator 1, x2x_2 is the state variable of oscillator 2, f(x1)f(x_1) is the dynamics of oscillator 1, f(x2)f(x_2) is the dynamics of oscillator 2, and ϵ\epsilon is the coupling strength. This coupling is called linear coupling, or diffusive coupling, because the coupling term is proportional to the difference between the two oscillators.

How is this related to diffusion?

The coupling term is proportional to the difference between the two oscillators, which is similar to the diffusion term in the diffusion equation: ut=D2ux2\frac{\partial u}{\partial t} = D \frac{\partial^2 u}{\partial x^2}.

(x1,y1,z1)=(x2,y2,z2),ϵ>ϵc(x_1,y_1,z_1) = (x_2,y_2,z_2), \epsilon > \epsilon_c

which means that the two oscillators have the same state variables when the coupling strength is greater than the critical coupling strength.

θ1=θ2+c,ϵ>ϵp\theta_1 = \theta_2 +c, \epsilon > \epsilon_p

where θ1\theta_1 is the phase of oscillator 1, θ2\theta_2 is the phase of oscillator 2, cc is a constant, and ϵp\epsilon_p is the critical coupling strength. This means that the two oscillators have the same phase when the coupling strength is greater than the critical coupling strength.

How is the pahse θ\theta defined ?

θ\theta is the angle of the oscillator in the phase space. For example, for the logistic map, θ\theta is the angle of the point (xn,xn+1)(x_n, x_{n+1}) in the phase space.

x1=G(x2),ϵ>ϵgx_1 = G(x_2), \epsilon > \epsilon_g

where x1x_1 is the state variable of oscillator 1, x2x_2 is the state variable of oscillator 2, GG is a function, and ϵg\epsilon_g is the critical coupling strength. This means that the two oscillators have the same state variables when the coupling strength is greater than the critical coupling strength.

Science is good, but engineering will make it great.

Why is chaos synchronization important?

Real systems are coupled, and chaos synchronization is a way to model the coupling.

Can we predict the synchronization point ϵc\epsilon_c given some x1,2(ϵi,t),i{1,2,3}x_{1,2}(\epsilon_i,t), i\in\{1,2,3\}?

Generate Data

One data pair looks like this:

Image

Image

We can visualize the data by plotting the observations of the first system against the observations of the second system for each epsilon value.

Image This displays the dynamic coupling between the two systems. This can be seen as a visualization of the phase space of the coupled system.

Another visualization look at the final state difference (x1x2|x_1-x_2|) of the system for each epsilon value.

Image

Which indicate a larger bias when ϵ\epsilon is large.

Weired right? We expect the system to synchronize!

To better understand the synchronization, we plot the coefficient between x1x_1 and x2x_2.

Image

This shows that the synchronization is not perfect, but the correlation is approaching -1 when ϵ\epsilon is large.

Anyway, we can use the data to train a model to predict the synchronization point ϵc\epsilon_c.

Create a Reservoir

A reservoir can be abstractly thought of as a dynamical system xn+1=f(xn)x_{n+1} = f(x_n), where xnx_n is the state vector at time nn and ff is the reservoir function. By supervised training, we can fit the reservoir to a given data set.

The community created a package to create reservoirs and apply it to dynamical system prediction.

Image

Here we review one of it's notebook to see how it works.

A (very) short tutorial on Reservoir Computing

Image

ESN: Echo State Network, a type of reservoir computing.

Image

Now we practice building a reservoir.

from reservoirpy.nodes import Reservoir, Ridge

reservoir = Reservoir(100, lr=0.5, sr=0.9)
ridge = Ridge(ridge=1e-7)

esn_model = reservoir >> ridge

This cute >> is a pipeline operator, which is a shorthand for esn_model = ridge.fit(reservoir.fit(data)). It's a syntactic sugar.

Let's first train a lorenz system.

Image

Visualize the prediction seperately.

Image

This is the prediction of xtx_t given xt1x_{t-1}.

To see if we can predict long term evolution, we can predict xt+Nx_{t+N} given xtx_{t}.

RC-ESN is SOTA in short-time weather prediction compared to ANN, RNN-LSTM, that's why it's actively researched.

Yet, recent weather-forcasting method is based on 3D-CNN, RNN-LSTM, and other models that are not based on RC-ESN.

AI for Physics

Jie Ren, TJU, 2023-07-31 09:11:25

Why physics in AI?

Symmetry, geometry, and conservation laws are the key to understand the world.

Physics-inspired: Inspired from bird is OK, but understand hydrodynamics is the key.

神经网络已经很好,但我们应该要超过模仿人脑的水平。

Hopfield is still alive and well.

AI done on physical system is the future of computing

AI algorithm is possiblly inspired by physics: Statistical machine learning inspired by wave and diffusion system.

Wave System

Protein folding is a NP-hard problem.

Yet nature is the most performant commputer.

Well, that's why analog computing is proposed and delayed due to it's inaccuracy.

Is today the best time to bring back analog computing?

Matrix Optics:

Wave can be transformed by an optical set, defined by Fresnel-Kirchhoff diffraction integral.

U(x,y)=1iλU(x,y)eikrrdxdyU(x,y) = \frac{1}{i\lambda} \int_{-\infty}^\infty \int_{-\infty}^\infty U(x',y') \frac{e^{ikr}}{r} dx' dy'

where U(x,y)U(x,y) is the wave function, λ\lambda is the wavelength, kk is the wave number, rr is the distance between the point (x,y)(x,y) and the point (x,y)(x',y').

Not only the amplititue , but also the phase of the wave can be transformed by an optical set.

Will neural network be better when expanded on complex plain?

Optics is suitable for spiking neural netowork using phase transition and circular-harmoic oscillators.

How recurrent NN can be mapped to wave function

Given a wave distribuiton u(x,y)u(x,y), we have the equation:

2ut2c22u=f\frac{\partial^2 u}{\partial t^2} - c^2 \nabla^2 u = f

where uu is the wave distribution, tt is the time, cc is the wave speed, and 2\nabla^2 is the Laplace operator, and ff is the source term.

Do a simple discretization:

ut+12ut+ut1Δt2c22ut=ft\frac{u_{t+1} - 2u_t + u_{t-1}}{\Delta t^2} - c^2 \nabla^2 u_t = f_t

We can rewrite it a s a matrix form:

[ut+1ut]=[2Ic2Δt22II0][utut1]+Δt2[ft0]\begin{bmatrix} u_{t+1}\\ u_t \end{bmatrix} = \begin{bmatrix} 2I - c^2 \Delta t^2 \nabla^2 & -I\\ I & 0 \end{bmatrix} \begin{bmatrix} u_t\\ u_{t-1} \end{bmatrix} + \Delta t^2\begin{bmatrix} f_t\\ 0 \end{bmatrix}

where ut+1u_{t+1} is the wave distribution at time t+1t+1, utu_t is the wave distribution at time tt, ut1u_{t-1} is the wave distribution at time t1t-1, II is the identity matrix, Δt\Delta t is the time step, and 2\nabla^2 is the Laplace operator.

We can define the hidden state ht=[utut1]h_t = \begin{bmatrix} u_t\\ u_{t-1}\end{bmatrix}, the weight matrix W=[2Ic2Δt22II0]W = \begin{bmatrix} 2I - c^2 \Delta t^2 \nabla^2 & -I\\ I & 0\end{bmatrix}, and the input xt=[ft0]x_t = \begin{bmatrix} f_t\\ 0\end{bmatrix}, then we have

ht+1=Wht+Δt2xtyt+1=[P(0)ht+1]2h_{t+1} = W h_t + \Delta t^2 x_t\\ y_{t+1} = [P^{(0)} h_{t+1}]^2

where ht+1h_{t+1} is the hidden state at time t+1t+1, WW is the weight matrix, hth_t is the hidden state at time tt, xtx_t is the input at time tt, yt+1y_{t+1} is the output at time t+1t+1, and P(0)P^{(0)} is the projection operator.

How is the training done?

No backpropagation needed. Pseudo-inverse is used to train the network.

Pseudo-inverse is defined as:

A+=(ATA)1ATA^+ = (A^T A)^{-1} A^T

where A+A^+ is the pseudo-inverse of AA, ATA^T is the transpose of AA, and A1A^{-1} is the inverse of AA.

It's called psuedo-inverse because it is not defined when ATAA^T A is not invertible.

To train an optical neural network, we need to solve the following equation:

Y=XWY=XW

The solution using pseudo-inverse is:

W=X+Y=(XTX)1XTYW = X^+ Y = (X^T X)^{-1} X^T Y

This is only linear , due to the difficulty in replicating non-linear activations.

How quantum grover algorithm can be mapped to ML

Grover algorithm is a quantum algorithm that finds a specific element in an unsorted list with high probability.

To find the element xx in the list LL, we need to find the index ii such that L[i]=xL[i] = x.

Steps of Grover algorithm:

Why this works, an intuitive explanation?

The oracle operator OO flips the sign of the state i\ket{i} if L[i]=xL[i] = x, and does nothing if L[i]xL[i] \ne x. The diffusion operator DD flips the sign of the state i\ket{i} if L[i]xL[i] \ne x, and does nothing if L[i]=xL[i] = x.

So the oracle operator OO and the diffusion operator DD together can amplify the amplitude of the state i\ket{i} if L[i]=xL[i] = x, and reduce the amplitude of the state i\ket{i} if L[i]xL[i] \ne x. So after kk iterations, the state i\ket{i} with L[i]=xL[i] = x will have a much higher amplitude than the state i\ket{i} with L[i]xL[i] \ne x, so we can measure the state i\ket{i} to get the index ii.

Real intelligence is creation.

看视频的智能<<创造视频的智能

Implementation of Grover algorithm using optical neural network:

Using surface wave materials to implement the oracle operator OO and the diffusion operator DD?

Diffusion System

Dimension reduction is the key to understand the world. It's the essence of understanding.

Two ingredient of Learning:

Manifold learning

<!-- To be Continued -->

Diffusion mapping

Given data {xn}n=1N,xnRp\{x_n\}^N_{n=1}, x_n\in \mathbb{R}^p, we define the distance between xix_i and xjx_j as:

Ai,j=exp(xixj2ϵ)A_{i,j} = \exp \Big(-\frac{\|x_i - x_j\|^2}{\epsilon}\Big)

where Ai,jA_{i,j} is the distance between xix_i and xjx_j, xix_i is the ii-th data point, xjx_j is the jj-th data point, ϵ\epsilon is the parameter.

Gaussian kernel is used here, because it is a smooth function.

We define the diagonal matrix DD as:

Di,i=j=1NAi,jD_{i,i} = \sum_{j=1}^N A_{i,j}

where Di,iD_{i,i} is the ii-th diagonal element of DD, and Ai,jA_{i,j} is the distance between xix_i and xjx_j.

We define the probability matrix PP as:

P=D1AP = D^{-1} A

where PP is the probability matrix, DD is the diagonal matrix, and AA is the matrix.

Then the Laplacian matrix LL is defined as:

L=IPL = I - P

where LL is the Laplacian matrix, II is the identity matrix, and PP is the probability matrix.

The equation of motion can be written as:

tp(t)=Lp(t)\frac{\partial }{\partial t} \bold{p}(t) = -L \bold{p}(t)

which is the diffusion equation. Where p(t)\bold{p}(t) is the probability distribution at time tt, and LL is the Laplacian matrix.

what ds the difference between row and column normalization?

They represent different physics systems. Row normalization is diffusion, and column normalization is heat conduction. What makes them different?\red{\text{What makes them different?}}

Google PageRank is essentially this idea, but it developed a faster algorithm. (well, is left and wright different?)

1.1 网络拓扑结构

设W为所有网页的集合,其中的每个网页都可以从根网页开始经过有限步的跳转被访问。

n=Card(W)n=Card(W)是网页的数量,对Google来说,这个数值随着时间变化,到2004年6月,n已经超过4千万。

定义G为n×nn\times n的连接矩阵,满足gij=1g_{ij}=1如果网页i有指向网页j的链接。否则设为0.

它描述了互联网的拓扑结构,是一个非常巨大,但是很稀疏sparse 的矩阵。

实际上,矩阵中所有非零元的数量就是万维网中的超链接数量。

1.2 马尔可夫转移矩阵

用户在一个网页可能跟随下一个链接(概率p),也可能随机跳转到任何一个网页,因此可以得到转移概率矩阵

首先定义ri=jhijr_i=\sum_j h_{ij},它是从i网页可以连出去的j网页的数量

那么可以得到转移概率矩阵W

wij=phij/ri+(1p)/nw_{ij}=ph_{ij}/r_{i}+(1-p)/n,如果ri0r_{i}\ne0,即i网页中存在超链接指向其他网页。

wij=1/nw_{ij}=1/n,如果ri=0r_{i}=0

注意到矩阵W是对连接矩阵行的求和的scaling ,此时得到的就是从一个网页跳转到另外一个网页的概率。实际上通常让p=0.85p=0.85,考虑n=4×109n=4\times 10^9,那么这里的每个值都是很小的。

由此我们得到了马尔可夫转移概率矩阵,其元素严格介于0和1,并且每行之和严格等于1(即必然会跳出这个网页)。

1.3 网站排序

计算网页的顺序,简单来说就是计算稳定的分布,并按照概率进行评分,即计算方程

Ax=xAx=x

的解,其中xx 满足xi=1\sum x_i=1

实际运算时,一种可行的方案是从一个状态开始,不断的计算x=Axx=Ax,直到迭代的两个矢量的差的模长小于一个特定的值。

2 稀疏矩阵的幂运算

2.1 运算的化简

Goggle的实际运行方式其实根本不涉及幂运算,因为AkA^k的计算可以以通过更新超链接的权重进行

在Matlab中计算PageRank利用了马尔可夫矩阵的特定结构,以下的方法可以保持矩阵的稀疏性,将矩阵写为

A=pGD+ezTA=pGD+ez^T

其中矩阵DD定义为:如果ri0 ,dii=1/rir_i\ne0~,d_{ii}=1/r_i,否则dii=0d_{ii}=0

ee为n维矢量,分量均为1.

zz为n维矢量,分量满足zi=(1p)/nz_i=(1-p)/n,如果ri0r_i\ne 0,否则zi=1/nz_i=1/n

从而方程可写为

(IpGD)x=γe(I-pGD)x=\gamma e

其中γ=zTx.γ = z^T x.

虽然我们不知道γ\gamma的值,但是只要我们求解了(1pGD)x=e(1-pGD)x=exx,就可以通过xi=1\sum x_i=1直接确定满足的解。

因此问题转化为求解

(IpGD)x=e(I-pGD)x=e

Google implement a α0.85\alpha\approx 0.85 that stops the chain from trapping. But the problem we are dealling here require nn eigen value and vector, and is working on a much smaller system.

Any way, what left is to compute the final distribuiton of the chain.

Find solution for

Px=x\bold{P}x= x

Ergodicity? It dependes on the connectivity of the graph

This method claims to reached a new representation of the data, and can define a new distance ρ\rho between the data points.

<!-- Under construction -->

Langevin diffusion

Focker-Planck equation:

pt=(pv)+(Dp)\frac{\partial p}{\partial t} = -\nabla \cdot (p \bold{v}) + \nabla \cdot (\bold{D} \nabla p)

where pp is the probability distribution, tt is the time, v\bold{v} is the velocity, and D\bold{D} is the diffusion tensor.

Langevin equation:

x˙=U(x)+f+2Dη\dot{\bold{x}} = - \nabla U(\bold{x}) + \bold{f} + \sqrt{2D} \bold{\eta}

where x\bold{x} is the position, U(x)U(\bold{x}) is the potential, f\bold{f} is the force, DD is the diffusion coefficient, and η\bold{\eta} is the Gaussian white noise.

what if this is not gaussian noise. The equivalence between the Focker-Planck equation and the Langevin equation:

Given

U(x)=logp(x)x˙=U(x)+f+2Dηη=0ηi(t)ηj(t)=δijδ(tt)U(\bold{x}) = - \log p(\bold{x}) \\ \bold{\dot{x}} = - \nabla U(\bold{x}) + \bold{f} + \sqrt{2D} \bold{\eta}\\ \braket{\eta} = 0\\ \braket{\eta_i(t) \eta_j(t')} = \delta_{ij} \delta(t-t')

we have

tp(x,t)=x(px˙)=x(pU(x)pfp2Dη)=Ω^p\partial_tp(x,t) = - \partial_x \cdot (p \dot{\bold{x}}) = - \partial_x \cdot (p \nabla U(\bold{x}) - p \bold{f} - p \sqrt{2D} \bold{\eta}) = \hat{\Omega} p

Then

p(x,t+δt)=Texp(tt+ΔtdtΩ^)p(x,t)=[1+tt+ΔtdtΩ^+12tt+Δtdttt+ΔtdtΩ^(t)Ω^(t)]p(x,t)+O(Δt3)=p(x,t)+δt[(δU)p]+2tt+Δtdttt+Δtdt[(δU)pDη]+O(Δt3)=p(x,t)+Δt[(U)p]+(x2p)Δt+O(Δt3)p(x, t+ \delta t) = T \exp{(\int_{t}^{t + \Delta t}dt \hat{\Omega})} p(x,t)\\ = [1+ \int_{t}^{t+\Delta t} dt \hat{\Omega} + \frac{1}{2} \int_{t}^{t+\Delta t} dt \int_{t}^{t+\Delta t} dt' \hat{\Omega}(t) \hat{\Omega}(t') ] p(x,t) + O(\Delta t^3)\\ = p(x,t) + \delta t \nabla \cdot [(\delta U)p] + 2\int_{t}^{t+\Delta t} dt \int_{t}^{t+\Delta t} dt' \nabla \cdot [(\delta U) p \sqrt{D} \bold{\eta}] + O(\Delta t^3)\\= p(x,t) + \Delta t \nabla \cdot[(-\nabla U)p] + (\partial_x^2p)\Delta t + O(\Delta t^3)

As Δt0\Delta t \to 0, we haves

tp=[(U)p]+(x2p)\partial_t p = - \nabla \cdot [(-\nabla U)p] + \nabla \cdot (\partial_x^2p)

reducing to the Focker-Planck equation.

Topological Phonon Hall Effect

Topological phonon Hall effect is the phenomenon that the phonon Hall conductance is quantized in a topological insulator, so the heat can only be conducted in one direction.

Theory:

H=12pTp+12uT(KA2)u+uTApH = \frac{1}{2}p^Tp + \frac{1}{2} u^T(K-A^2)u + u^TAp

where HH is the Hamiltonian, pp is the momentum, uu is the displacement, KK is the stiffness matrix, and AA is the matrix of spin-latice coupling.

What is the physical meaning of the stiffness matrix?

It is the matrix of the spring constants.

We have

J=4Vσ,σ,kωσ,kωσ,kaσ,k(Ω2k+[A,Ω2])σ,σaσ,kJ = \frac{\hbar}{4V}\sum_{\sigma, \sigma' ,\bold{k}} \frac{\omega_\sigma, \bold{k}}{\omega_\sigma, \bold{k}} \bold{a}_{\sigma, \bold{k}} (\frac{\partial \Omega^2}{\partial \bold{k}} + [A,\Omega^2])_{\sigma, \sigma'} a_{\sigma', \bold{k}}

where JJ is the phonon Hall conductance, \hbar is the reduced Planck constant, VV is the volume, σ\sigma is the spin, k\bold{k} is the wave vector, ωσ,k\omega_\sigma, \bold{k} is the phonon frequency, aσ,k\bold{a}_{\sigma, \bold{k}} is the phonon annihilation operator, Ω\Omega is the Berry connection, and AA is the matrix of spin-latice coupling.

The normal velocity Ω2k\frac{\partial \Omega^2}{\partial \bold{k}} is responsible for the longitudinal phonon transport. The Berry curvature [A,Ω2][A,\Omega^2] is responsible for the transverse phonon transport.

How theory classify different phases?

By the topological invariant, e.g. Chern number.

Problems arise when we are dealing with amorphous topological phonon strucutures, because the topological invariant is not well defined.

Why classification of different phase important?

Recommendation system

Previous diffusion model, we define

P=AD1P=AD^{-1}

but for heat conduction, we define

P=D1AP=D^{-1}A

What is the difference between the two?

Recommedation system is a system that recommends items to users based on their preferences, if we denote the preference of user ii for item jj as rijr_{ij}, then we have

rij=k=1Kuikvkjr_{ij} = \sum_{k=1}^K u_{ik} v_{kj}

where rijr_{ij} is the preference of user ii for item jj, uiku_{ik} is the preference of user ii for feature kk, vkjv_{kj} is the preference of item jj for feature kk, and KK is the number of features.

Consider a book selling situation, here item is book, user is reader, and feature is genre.

The central challenge in recommendation system:

After some tedious derivaion, our gioal is still find the eigen values and eigen vectors. It apply finite iterations to do this. (with truncation)

Machine learning stochastic dynamics

yin Tang, BNU, 2023-07-31 13:04:59

Dynamic system can be roughly divided into deterministic and stohastic

Stochastic dynamics (analytical on blackbooard)

Machine learning discrete-state stochasitc dynamics (coding)

Machine learning continuous-state stochasitic dynmaics;

Marvovian stochastic dynamics

State Discrete Continuous
Trajectory view Recursion equation(e.g. Ranodm walk) Stochastic differential (e.g. Ornstein-Uhlenbeck process)
Distribution view Master equation(e.g. random walk, birth-death, chemical master equation) Fokker-Planck equation(e.g. diffusion equation)

Brownian motion:

x˙=ξ(t)\dot{x} = \xi(t)

where x˙\dot{x} is the velocity, and ξ(t)\xi(t) is the Gaussian white noise.

It's statistical property is:

ξ(t)=0ξ(t)ξ(t)=δ(tt)\braket{\xi(t)} = 0\\ \braket{\xi(t) \xi(t')} = \delta(t-t')

where ξ(t)\braket{\xi(t)} is the average of ξ(t)\xi(t), and ξ(t)ξ(t)\braket{\xi(t) \xi(t')} is the average of ξ(t)ξ(t)\xi(t) \xi(t').

Now we have

x(t)=0tdtξ(t)=0x(t)x(t)=0tdtξ(t)0tdtξ(t)=0tdt0tdtξ(t)ξ(t)=t\braket{x(t)} = \braket{\int_0^t dt' \xi(t')} = 0\\ \braket{x(t) x(t')} = \braket{\int_0^t dt' \xi(t) \int_0^{t'} dt'' \xi(t'')} = \int_0^t dt' \int_0^{t'} dt'' \braket{\xi(t) \xi(t'')} = t\\

Given a random walk on a line with movement 1, with equal probability on left or right

xt2=(xtt1+δx)2=2Dt\braket{x_t^2} = \braket{(x_t{t-1}+\delta x)^2} = 2Dt

The final distribution of the data is

P(x,t)=12Dtex22DtP(x,t)=\frac{1}{\sqrt{2Dt}}e^{-\frac{x^2}{2Dt}}

where P(x,t)P(x,t) is the probability distribution, xx is the position, tt is the time, and DD is the diffusion coefficient.

Distribuiton view: random walk

P(n,t+Δ)=P(n1,t)12+P(n+1,t)12P(n, t+\Delta ) = P(n-1, t)\cdot \frac{1}{2} + P(n+1, t)\cdot \frac{1}{2}

The change in probability overthe time interval $\Delta t$ is

ΔP(n,t)=P(n,t+Δt)P(n,t)=P(n1,t)12+P(n+1,t)12P(n,t)\Delta P(n, t) = P(n, t+\Delta t) - P(n, t) = P(n-1, t)\cdot \frac{1}{2} + P(n+1, t)\cdot \frac{1}{2} - P(n, t)

Take continuous-time space limit and use diffusoion constant:

Pt=D2Px2\frac{\partial P}{\partial t} = D \frac{\partial^2 P}{\partial x^2}

where PP is the probability distribution, tt is the time, DD is the diffusion coefficient, and xx is the position.

can we dervie the distribution from this equation?

Solve the equation:

Here we have 4D4D, but what we commonly see is 2D2D, this is because we have two directions, and the diffusion coefficient is the average of the two directions.

D=12(Dx+Dy)D = \frac{1}{2} (D_x + D_y)

So

P(x,t)=12πDtex22DtP(x,t) = \frac{1}{\sqrt{2\pi Dt}} e^{-\frac{x^2}{2Dt}}

is for 1- Ddiffusion, and 4D4D is for 2-D diffusion.

For 3-D diffusion, it is 6D6D.

If you got 7 solutions to a problem, you finally understand it. -- Feynman

A Random Walk Down Wall Street: The Time-Tested Strategy for Successful Investing. -- Burton G. Malkiel

An experiemnt gives D=51.1μ m2/sD=51.1\mu\text{ m}^2/s of 大肠杆菌(E. coli), we can derive

x2=2Dt\braket{x^2} = 2Dt

To walk for around x=1 cmx=1\text{ cm}, we need t=106 st=10^6\text{ s}, which is around 11 days.

This is too slow.

If we put food to the side, it will have a drift velocity vv, and the equation becomes

Pt=D2Px2vPx\frac{\partial P}{\partial t} = D \frac{\partial^2 P}{\partial x^2} - v \frac{\partial P}{\partial x}

where PP is the probability distribution, tt is the time, DD is the diffusion coefficient, xx is the position, and vv is the drift velocity.

Random walk is everywhere, e.g. complex network, search problem, complex system, and anomalous diffusion.

Every stochastic process can be mapped to a random walk.

seriously ?

Chemical master equation

Chemical master equation is a stochastic model for chemical reactions. It is stochastic with fluctuations, many chemical species beyond a single "walker".

One specis., one reactin: xkx\to^k \emptyset, which means that the species xx decays to nothing with rate kk.

Continuous: Reaction Rate Equation

dxdt=kx+noise\frac{dx}{dt} = -kx + \text{noise}

which is the rate of change of the number of molecules.

Discrete: Chemical Master Equation

tPt(n)=k(n+1)Pt(n+1)knPt(n)\partial_t P_t(n) = -k(n+1)P_t(n+1) - knP_t(n)

where Pt(n)P_t(n) is the probability distribution, tt is the time, nn is the number of molecules, and kk is the reaction rate. This function is intuitively the probability of having nn molecules at time tt.

The discre description is more accurate, but it is hard to solve for large nn.

A solvable example: birth-death process

Define

X[0,N],k2X,Xk1X\in [0,N], \emptyset \to^{k_2} X, X \to^{k_1} \emptyset

We have the equation

dPt(n)dt=B(n+1)Pt(n+1)+F(n1)Pt(n1)(B(n)+F(n))Pt(n)F(n)=k2B(n)=k1n\frac{dP_t(n)}{dt} = B(n+1)P_t(n+1) + F(n-1)P_t(n-1) - (B(n) + F(n))P_t(n)\\ F(n) = k_2\\ B(n) = k_1 n

where Pt(n)P_t(n) is the probability distribution, tt is the time, nn is the number of molecules, B(n)B(n) is the birth rate, F(n)F(n) is the death rate, k1k_1 is the birth rate constant, and k2k_2 is the death rate constant.

How to solve it?

Steady state: dPt(n)dt=0\frac{dP_t(n)}{dt} = 0, or

B(n+1)Pt(n+1)+F(n1)Pt(n1)=(B(n)+F(n))Pt(n)B(n+1)P_t(n+1) + F(n-1)P_t(n-1) = (B(n) + F(n))P_t(n)

We can use a generating function to solve it:

Solve the equation ,we have

A=1r2r1r2r2k1B=1r2r1r1r2k1r1=k1k12+4k22r2=k1+k12+4k22A=\frac{1}{r_2 - r_1} \frac{r_2}{r_2 - k_1} \\ B=\frac{1}{r_2 - r_1} \frac{-r_1}{r_2 - k_1}\\ r_1 = \frac{k_1 - \sqrt{k_1^2 + 4k_2}}{2}\\ r_2 = \frac{k_1 + \sqrt{k_1^2 + 4k_2}}{2}

This is only for stationary state, how about the time evolution?

Omitted.

Catalan number is the number of ways to put nn pairs of parentheses:

Cn=1n+1(2nn)C_n = \frac{1}{n+1} \binom{2n}{n}

We can derive it's recurrence relation using generating function:

化学主方程是一种描述随机化学反应系统中分子数目的概率分布随时间变化的数学模型。它可以用来研究那些分子数目很少或者受到随机波动影响的反应系统,例如生物细胞内的反应。

连续的情况下,反应速率方程是一种常微分方程,它假设分子数目是一个连续变量,可以用浓度或者摩尔数来表示。反应速率方程描述了分子数目随时间的变化率,它与反应速率常数和分子数目本身有关。

dxdt=kx\frac{dx}{dt} = -kx

表示一个一阶反应AkBA \xrightarrow{k} B,其中xxAA分子的数目,kk是反应速率常数。这个方程的意义是,AA分子的消耗速率与AA分子的数目成正比,比例系数为kk。这个方程的解是

x(t)=x(0)ektx(t) = x(0) e^{-kt}

其中x(0)x(0)是初始时刻AA分子的数目。这个解表明,随着时间的增加,AA分子的数目呈指数衰减。

离散的情况下,化学主方程是一种偏微分方程,它假设分子数目是一个离散变量,只能取整数值。化学主方程描述了分子数目的概率分布随时间的变化率,它与转移概率有关。转移概率是指在一个很小的时间间隔内,系统从一个状态跳到另一个状态的概率。

tPt(n)=k(n+1)Pt(n+1)knPt(n)\partial_t P_t(n) = -k(n+1)P_t(n+1) - knP_t(n)

也表示一个一阶反应AkBA \xrightarrow{k} B,其中nnAA分子的数目,Pt(n)P_t(n)是在时刻tt系统处于状态nn的概率。这个方程的意义是,状态nn的概率随时间的变化率等于从状态n+1n+1跳到状态nn和从状态nn跳到状态n1n-1的概率之差。从状态n+1n+1跳到状态nn的概率为k(n+1)Pt(n+1)k(n+1)P_t(n+1),因为每个AA分子都有kΔtk\Delta t的概率在Δt\Delta t内发生反应,并且有n+1n+1个这样的分子。从状态nn跳到状态n1n-1的概率为knPt(n)k n P_t(n),原因同上。这个方程可以用递推公式或者生成函数来求解.

高情商:时代变了,大家解析能力不一样了

Solve function

tη(y,t)=k1yη\partial_t \eta(y,t) = -k_1\partial_y \eta

We use characteristic line method, which is based on the idea that the solution of the PDE is constant along the characteristic lines, which are the integral curves of the vector field (1,k1)(-1, -k_1).

An intuitive explanation: the solution of the PDE is constant along the characteristic lines, because the PDE is the rate of change of the solution, and the characteristic lines are the lines that the solution is constant.

# Define the PDE of the form a(x,y)u_x + b(x,y)u_y = c(x,y)u + d(x,y)
# Define the boundary conditions for u
# Define the domain and the grid size
# Initialize the grid and the solution matrix
# Loop over the grid points
  # Check if the point is on the boundary
  # If yes, use the boundary condition
  # If no, use the characteristic line method
    # Find the slope of the characteristic curve using dy/dx = b(x,y)/a(x,y)
    # Find the ODE along the characteristic curve using du/dx = (d(x,y) - c(x,y)u)/a(x,y)
    # Find the previous point on the characteristic curve using backward difference
    # Interpolate the value of u at the previous point using linear interpolation
    # Solve the ODE along the characteristic curve using a numerical method
  # Store the value of u at the current point in the solution matrix
# End loop
# Plot or output the solution
# Import libraries
import numpy as np
import matplotlib.pyplot as plt

# Define the functions a, b, c, d
def a(x,y):
   return 1

def b(x,y):
   return 1

def c(x,y):
   return 0

def d(x,y):
   return 0

# Define the boundary conditions for u
def u_left(y):
   return np.sin(y)

def u_right(y):
   return np.cos(y)

# Define the domain and the grid size
x_min = 0
x_max = 1
y_min = 0
y_max = np.pi
n_x = 11 # number of grid points in x direction
n_y = 11 # number of grid points in y direction
h_x = (x_max - x_min) / (n_x - 1) # grid spacing in x direction
h_y = (y_max - y_min) / (n_y - 1) # grid spacing in y direction

# Initialize the grid and the solution matrix
x = np.linspace(x_min, x_max, n_x)
y = np.linspace(y_min, y_max, n_y)
u = np.zeros((n_x, n_y))

# Define a function that computes the slope of the characteristic curve
def slope(x,y):
   return b(x,y) / a(x,y)

# Define a function that computes the ODE along the characteristic curve
def ode(x,y,u):
   return (d(x,y) - c(x,y) * u) / a(x,y)

# Define a numerical method to solve the ODE (Euler's method)
def euler(x,y,u,h):
   return u + h * ode(x,y,u)

# Loop over the grid points
for i in range(n_x):
   for j in range(n_y):
       # Check if the point is on the left or right boundary
       if i == 0:
           # Use the left boundary condition
           u[i,j] = u_left(y[j])
       elif i == n_x - 1:
           # Use the right boundary condition
           u[i,j] = u_right(y[j])
       else:
           # Use the characteristic line method
           # Find the previous point on the characteristic curve
           x_prev = x[i-1]
           y_prev = y[j] - h_x * slope(x_prev, y[j])
           # Interpolate the value of u at the previous point
           u_prev = np.interp(y_prev, y, u[i-1,:])
           # Solve the ODE along the characteristic curve
           u[i,j] = euler(x_prev, y_prev, u_prev, h_x)

# Plot or output the solution
plt.contourf(x, y, u.T)
plt.xlabel('x')
plt.ylabel('y')
plt.title('Solution of PDE by characteristic line method')
plt.colorbar()
plt.show()

Image This image is a solution for the PDE

tη(y,t)+yη=0\partial_t \eta(y,t) + \partial_y \eta = 0

with the boundary conditions η(0,t)=sin(t)\eta(0,t) = \sin(t) and η(π,t)=cos(t)\eta(\pi,t) = \cos(t).

Gillespie algorithm

Gillespie algorithm is a stochastic simulation algorithm for chemical master equation. It is a Monte Carlo method that simulates the time evolution of a chemical system.

A pseudo-code of the algorithm:

# Define the system with the initial number of molecules and the reaction rate constants
# Initialize the time to zero
# Loop until the end time or condition is reached
  # Calculate the total propensity function by summing all the reaction propensities
  # Generate two random numbers from a uniform distribution between 0 and 1
  # Use one random number to determine the time interval until the next reaction event
  # Use another random number to determine which reaction will occur next
  # Update the system by increasing the time by the time interval and changing the number of molecules according to the chosen reaction
# End loop
# Plot or output the results

This algorithm is based on the following theory:

# Import libraries
import numpy as np
import matplotlib.pyplot as plt

# Define the system with two reactions: A -> B and B -> A
# The initial number of molecules for A and B are 10 and 0, respectively
# The reaction rate constants for A -> B and B -> A are 1.0 and 0.5, respectively
x_A = 10 # number of A molecules
x_B = 0 # number of B molecules
k_1 = 1.0 # rate constant for A -> B
k_2 = 0.5 # rate constant for B -> A

# Initialize the time to zero
t = 0

# Create empty lists to store the time and molecule values
t_list = []
x_A_list = []
x_B_list = []

# Loop until the end time of 10 is reached
while t < 10:
    # Calculate the total propensity function
    a_total = k_1 * x_A + k_2 * x_B
    
    # Generate two random numbers from a uniform distribution between 0 and 1
    r_1 = np.random.uniform(0,1)
    r_2 = np.random.uniform(0,1)
    
    # Use one random number to determine the time interval until the next reaction event
    tau = (1 / a_total) * np.log(1 / r_1)
    
    # Use another random number to determine which reaction will occur next
    if r_2 < (k_1 * x_A) / a_total:
        # A -> B occurs
        x_A -= 1 # decrease A by 1
        x_B += 1 # increase B by 1
    else:
        # B -> A occurs
        x_A += 1 # increase A by 1
        x_B -= 1 # decrease B by 1
    
    # Update the time by adding the time interval
    t += tau
    
    # Append the time and molecule values to the lists
    t_list.append(t)
    x_A_list.append(x_A)
    x_B_list.append(x_B)

# Plot or output the results
plt.plot(t_list, x_A_list, label="A")
plt.plot(t_list, x_B_list, label="B")
plt.xlabel("Time")
plt.ylabel("Number of molecules")
plt.title("Gillespie algorithm for a simple chemical system")
plt.legend()
plt.show()

Image

Gaurateen of this algorihtm:

Take-home message:

Stochastic reaction networks with MM species (each with count NN)

j=1MrkiXickj=1MpkiXi\sum_{j=1}^M r_{ki}X_i \xrightarrow{c_k} \sum_{j=1}^M p_{ki}X_i

We have equation

tPt(n)=k=1K[ak(nsk)Pt(nsk)ak(n)Pt(n)]\partial_t P_t(\bold{n}) = \sum_{k=1}^K [a_k(\bold{n} - s_k)P_t(\bold{n} - s_k) - a_k(\bold{n})P_t(\bold{n})]

Then we can derive the final state of the system with the matrix r,p,cr,p,c. (The matrix rr is the stoichiometry matrix, pp is the propensity matrix, and cc is the reaction rate matrix.)

c=prc = p - r

The chemical transformation matrx T\mathbb{T} is defined as

A numeric example:

1AA1\emptyset \xrightarrow{1} A\\ A \xrightarrow{1} \emptyset

We have

r=[11]p=[11]c=[20]r = \begin{bmatrix} -1\\ 1 \end{bmatrix}\\ p = \begin{bmatrix} 1\\ 1 \end{bmatrix}\\ c = \begin{bmatrix} 2\\ 0 \end{bmatrix}

The joint probability Pt(n)P_t(\bold{n}) can be difficult to handle, since n\bold{n} is a vector of length NMN^M.

We can propose a parameterization of the joint probability Ptθt(n)P_t^{\theta_{t}}(\bold{n}), and then minimize the loss defined as

L=DKL(Pt+δtθt+δt(n)TPtθt(n))\mathcal{L}=D_\text{KL}(P_{t^+\delta t}^{\theta_{t+\delta t}}(\bold{n})||\mathbb{T}P_t^{\theta_t}(\bold{n}))

where DKLD_\text{KL} is the Kullback-Leibler divergence, Pt+δtθt+δt(n)P_{t^+\delta t}^{\theta_{t+\delta t}}(\bold{n}) is the joint probability at time t+δtt+\delta t with parameter θt+δt\theta_{t+\delta t}, Ptθt(n)P_t^{\theta_t}(\bold{n}) is the joint probability at time tt with parameter θt\theta_t, and T\mathbb{T} is the chemical transformation matrix.

The dimension of the matrix T\mathbb{T} is NM×NMN^M\times N^M, where NN is the number of molecules, and MM is the number of species.

The loss function is the Kullback-Leibler divergence between the joint probability at time t+δtt+\delta t and the joint probability at time tt transformed by the chemical transformation matrix.

Why not implement the time tt as an embedding?

Tried , not promising due to the long step negative probability problem.

Is this size big enough? Given NN and MM, the distribuiton is of dimension NMN^M, where n\bold{n} indicates the number state of he system

e.g. 1,0,0\ket{1,0,0} means there is one molecule of species 1, and no molecule of species 2 and 3.

Learning nonequilibrium statistical mechanics and dynamical phase transitions

Ying Tang, International Academic Center of Complex Systems, Beijing Normal University

Non-equilibrium statistical mechanics is the study of the behavior of systems that are not in thermodynamic equilibrium. Most systems found in nature are not in thermodynamic equilibrium because they are not in stationary states, and are continuously and discontinuously subject to flux of matter and energy to and from other systems and to and from their environment.

ddtPt=WPt\frac{d}{dt}\ket{P_t} = \mathcal{W} \ket{P_t}

where Pt\ket{P_t} is the probability distribution at tt, and W\mathcal{W} is the generator of the Markov process.

P=xP(x)xx=(x1,x2,,xn)\ket{P} = \sum_x P(x) \ket{x}\\ \ket{x} = (x_1, x_2, \dots, x_n)

where P\ket{P} is the probability distribution, P(x)P(x) is the probability of the state xx, and x\ket{x} is the state.

Kinetically constrained models (KCMs) are lattice models of interacting particles that are subject to constraints on their dynamics. They are used to model the dynamics of glassy systems.

Why KCMs can be used to model the dynamics of glassy systems?

It provide a const

FA model, South-East model

WKCM=ifi(cσi++(1c)σic(1ni)(1c)ni)\mathcal{W}^{\text{KCM}} = \sum_i f_i (c\sigma_i^+ + (1-c)\sigma_i^- - c(1 - n_i) - (1-c)n_i)

where fif_i is the rate of the ii-th site, cc is the constraint parameter, σi+\sigma_i^+ is the creation operator on the ii-th site, σi\sigma_i^- is the annihilation operator on the ii-th site, nin_i is the occupation number on the ii-th site.

How to determine the way to flip the spin?

This can be done in the following way: flip the spin with probability cc if the site is occupied, and flip the spin with probability 1c1-c if the site is unoccupied.

Difficulty in applying this method to 3D systems:

ML can be used to estimate the dynamic partition function:

Z(t)=xeβE(x)P(x,t)\mathcal{Z}(t) = \sum_x e^{-\beta E(x)} P(x,t)

where Z(t)\mathcal{Z}(t) is the dynamic partition function, E(x)E(x) is the energy of the state xx, and P(x,t)P(x,t) is the probability of the state xx at tt.

How to estimate the dynamic partition function?

We may use autorergressive model to estimate the dynamic partition function:

Z(t)=xeβE(x)P(x,t)xeβE(x)i=1nP(xix<i,t)\mathcal{Z}(t) = \sum_x e^{-\beta E(x)} P(x,t) \approx \sum_x e^{-\beta E(x)} \prod_{i=1}^n P(x_i|x_{<i},t)

where P(xix<i,t)P(x_i|x_{<i},t) is the probability of the ii-th spin given the previous spins at tt.

Dynamic partition function and observable

Consider trajectory ensemble ωt={x0xt1xtI}\omega_t = \{\bold{x_0 \to x_{t_1} \to \dots \to x_{t_I}}\}.

The dynamical partition funtion is the moment generating function with the counting field ss:

Zt(s)=ωtei=1IsiE(xti)P(ωt)Z_t(s) = \sum_{\omega_t} e^{-\sum_{i=1}^I s_i E(x_{t_i})} P(\omega_t)

Here counting field ss is ?

This is the only work (the lecturer done) that use NN to observe things others don't.

Track the distribuiton oin Ornstein-Uhlenbeck process

Stochastic differential equation:

x˙=kx+Dξ(t)\dot{x} = -kx + \sqrt{D} \xi(t)

Fokker-Planck equation:

Pt=x[kxρ(x,t)]+12Dx2[ρ(x,t)]\frac{\partial P}{\partial t} = \partial_x[kx\rho(x,t)] + \frac{1}{2}D\partial_x^2[\rho(x,t)]

where PP is the probability distribution, tt is the time, xx is the position, kk is the drift coefficient, DD is the diffusion coefficient, and ξ(t)\xi(t) is the Gaussian white noise.

How to solve it?

path integral approach

This is the Langevin equation of the Ornstein-Uhlenbeck process, and can be analytically solved:

P(xN,tNx0,t0)=k2πD(1e2k(tNt0))ek(xNx0ek(tNt0))22D(1e2k(tNt0))P(x_N,t_N|x_0,t_0) = \sqrt{\frac{k}{2\pi D(1-e^{-2k(t_N-t_0)})}} e^{-\frac{k(x_N-x_0e^{-k(t_N-t_0)})^2}{2D(1-e^{-2k(t_N-t_0)})}}

where P(xN,tNx0,t0)P(x_N,t_N|x_0,t_0) is the probability distribution of the position xx at time tNt_N given the position x0x_0 at time t0t_0, kk is the drift coefficient, DD is the diffusion coefficient, tNt_N is the final time, and t0t_0 is the initial time.

How is this connected to Stochastic gradient discent?

SGD works as follows:

xn+1=xnηf(xn)+2ηDξnx_{n+1} = x_n - \eta \nabla f(x_n) + \sqrt{2\eta D} \xi_n

Under continuum limit, we have

x˙=f(x)+2Dξ(t)\dot{x} = -\nabla f(x) + \sqrt{2D} \xi(t)

Use flow model to solve Fokker-Planck equation: not introduced in detail. (work in progress)

Machine Learning, Statistical Physic, and Complex System

Xiaosong Chen, YNU, 2023-08-01 13:08:31

Appy it to some toy models, to see if this blackbox model is expalainable.

Long-range connected 2-D network percolation

This model exhibits a phase transition from a non-percolating phase to a percolating phase.

普适类(i.e. universality class): 在临界点附近,系统的物理性质具有普适性,即物理量的变化服从幂律分布。

普适类的定义:在一个普适类中,所有的模型都有相同的临界指数。

Critical phenomena of percolation:

P(p){(ppc)βppc+0ppcP(p)\propto \begin{cases} (p-p_c)^\beta & p\to p_c^+\\ 0 & p\le p_c \end{cases}

χ(p)ppcγ\chi(p)\propto |p-p_c|^{-\gamma}

标度性(i.e. scale invariance):在临界点附近,系统的物理性质具有标度性,即物理量的变化服从幂律分布。

ML supervised learning the phase transition of percolation, input is a 2-D network, output is the probability of percolation.

Different percolation models:

Differnce: bond percolation is more difficult to simulate

Finding:

what the hell is this all about, since it provide no theoretical insights or rigorous proof on these claims.

厄尔尼诺预测

A completely data driven approach

Open the blackbox, what can we do?

The key is to represent, not solve.

Play around with the dataset.

Weather is different in different areas. have you considered the difference?

This is a global data, not limited to a region.

Have you considered the long term prediction?

Data is challenging, weather data is limited in past decades.

Well, weather prediction is now working with CNN.

Variational Bayesian Method

Derivation of ELBO using KL divergence:

lnp(x0)=lnp(x0,z0)dz0=lnq(z0)p(x0,z0)q(z0)dz0q(z0)lnp(x0,z0)q(z0)dz0=L(q)\ln p(x_0) = \ln \int p(x_0, z_0) dz_0 = \ln \int q(z_0) \frac{p(x_0, z_0)}{q(z_0)} dz_0 \ge \int q(z_0) \ln \frac{p(x_0, z_0)}{q(z_0)} dz_0 = \mathcal{L}(q)

where p(x0)p(x_0) is the marginal likelihood, p(x0,z0)p(x_0, z_0) is the joint likelihood, q(z0)q(z_0) is the variational distribution, and L(q)\mathcal{L}(q) is the ELBO.

Fluctuation theorem

Fluctuation theorem is a theorem in statistical mechanics that describes the probability distribution of the time-averaged irreversible entropy production of a system that is arbitrarily far from equilibrium.

P(σ)P(σ)=eσ\frac{P(\sigma)}{P(-\sigma)} = e^{\sigma}

or

p0[W0;λ]p0[W0;ϵQλ]=eβW0\frac{p_0[W_0;\lambda]}{p_0[-W_0;\epsilon_Q\lambda]} = e^{\beta W_0}

where P(σ)P(\sigma) is the probability distribution of the time-averaged irreversible entropy production σ\sigma, p0[W0;λ]p_0[W_0;\lambda] is the probability distribution of the work W0W_0 done on the system, λ\lambda is the control parameter, ϵQ\epsilon_Q is the quantum of energy, and β\beta is the inverse temperature.

Diffusion Model

(Jiaming Song et al, ICLR 2021) Denoising Diffusion Implicit Model

Denoising Diffusion Implicit Models

🍀 Jianlin Su (Engineer*,*ZhuiYi AI) provided in-depth derivation on DDIM. Yang Song (PhD, Stanford) introduced the connection of Diffusion Model to Score-based Models.

How to generate two images interpolation using generative model?

An intuitive way: make it noisy again, and interpolate noise?

Guided diffusion: ?

How GAN interpolate ?

Three color lnk in a non-newton liquid

AI4Materials

Element table + spatial group = Crystal structure

Find material strucutre with high free energy and thermo-electrical value.

Boids algorithm: Birds as agents with the following rule:

The problem of new material design lies in the ability to grow that material.

Synthesis is another proess that can be reformulated as AI problem.

Automation is the key

A new Matterial that does not absorb sun radiation, but only the radiation of the universe background, which makes the temperature lower.

众包平台: similar to the ones for protein design and protein folding.

Nature is the best teacher

"Go for the mess".

Close speech

Find prosperious direction

Iteration: Iterate is the key in building something great. Instead of start everything from scratch, we would better build things on top of existing attempts.

Some quotes from Sam Altman (Founder/CEO of OpenAI):